Multiple-Choice Reborn: 2010

Monday, November 1, 2010

Rasch Model IRT Demystified

Questionable No Child Left Behind (NCLB) test cut scores have put Arkansas, Texas, New York, and now Illinois in the news this year. How NCLB test cut scores are set is of concern. The traditional method of just counting right marks (known as classical test theory or CTT) is not used.

Instead the Rasch model item response theory (IRT) that estimates student ability and question difficulty is used. It is an acceptable way to calibrate questions for computer assisted testing (CAT); where you only answer enough questions to determine pass or fail. This leaves in question how psychometricians, education officials, and politicians use the Rasch model on NCLB tests.

How tests are scored should not be a mystery known only to those who benefit directly from “higher test scores” that may have no other meaning or use. A detailed examination can also determine the Rasch model’s ability to make useful sense of classroom test results for instructional and student counseling purposes.

This blog will now pause a bit to relate the printouts from the Winsteps Rasch model IRT (student ability and item difficulty) with the Power Up Plus (right mark scoring or RMS) printouts in a new blog: Rasch Model Audit.

Power Up Plus (FreePUP) prints out two student counseling reports:

Table 3. Student Counseling Mark Matrix with Scores and Item Difficulty contains the same student marks that Ministep (the free version of [Winsteps]) starts with when doing a Rasch model IRT test score analysis. The most able students with the least difficult items are in the upper left. The least able students with the most difficult items are in the lower right. The relationships between student, item, mark, and test are presented in a highly usable fashion for both students and teachers for student counseling.

Table 3a. Student Counseling Mark Matrix with Mastery/Easy, Unfinished and Discriminating (MUD) Analysis re-tables the data to assist in the improvement of instruction and testing. Winsteps Rasch Model IRT quantifies each of the marks on these two tables. This is a most interesting and powerful addition to RMS. PUP Tables 3 and 3a will be used as working papers in this audit of the Rasch model.

On return, this blog will continue with the application of Knowledge and Judgment Scoring (KJS) and the Rasch model to promote student development (using all levels of thinking). We need accurate, honest and fair test results presented in an easy to understand and to use manner. KJS does this, as well as promotes student development. We also need to detect sooner when school and state officials are releasing meaningless test results (it took three years in New York). Both needs require some of the same insights.

Next: Rasch Model Audit

Friday, July 9, 2010

TAKS Qualms - Part 2

The last blog reported a change was made in the passing rate on the TAKS Social Studies Grade 8 tests for 2003 and 2004. It was a far greater deviation than that found between 2009 and 2010 of concern in the Houston Chronicle by Ericka Mellon.

Further study revealed that the passing rates on all four subjects (English Language Arts, Mathematics, Science, and Social Studies) were all changed on the Grade 10 tests for the years 2003 and 2004. Texas allowed their Rasch One Parameter IRT (ROPIRT) to roam the open range for two years.

In mathematics and science it returned impressive passing rates the first year and lower on the second year. Without the changes, it would have taken seven years for the passing rates to exceed the initial 2003 benchmark values. That just did not look right.

By 2006 Texas had fenced in their ROPIRT. It thereafter produced values that looked right to education officials. By changing the passing rates for 2003 and 2004, the resulting curves looked very right: slow and continued progress toward an impossible goal of a passing rate of 100% by 2014.

The largest change was in science: 69% passing in 2003 was changed to 42%, a change of 27 percentage points or a lowering of the original figure by 39%.

IMHO what we are seeing here is the result of learning to use a new statistical tool that many would like to believe is a “standard statistical process”. It generates the numbers the states believed were wanted by the federal government. Secretary Duncan now considers the results as, “lying to our students”.

Unlike Knowledge and Judgment Scoring (KJS) that assesses at all levels of thinking and Confidence Based Learning (CBL) that assesses at the mastery level, the Texas ROPIRT has been fed right count scoring (RCS) data at the lowest level of thinking. The emphasis in assessment has changed, from earning high scores, to justifying the lowest cut point score. With KJS and CBL, the emphasis is on producing self-correcting high quality achievers who would in general find these tests, “a waste of time”.

The one striking observation in all four charts is that the average percent test score, in general, shows a gradual increase from year to year. If these tests were of comparable difficulty (with unchanging cut scores, they must be of comparable difficulty for 2005-2009), student performance on these tests was increasing prior to 2010.

As is, the English Language Arts and Social Studies tests are performing at the mastery level, above an average test score of 80% and passing rates at 90% and above. These tests now function as check lists of what experts in these fields consider necessary. Quibbling over a few points on test scores must now give way to serious concern about the quality of these tests to detect those students who will succeed in future schooling and on the job. Passing the test must be meaningful in the real world as well as in the edu-politic-money games currently being played.

We do not yet know how the ROPIRT does its work, but we can observe its behavior. Texas has seen three periods of different behavior: 2003-2004, where the wild rate of passing results were later tamed to look right; 2005-2009, where the average test score and the cut scores changed in unison; and 2010, where all four tests showed increased passing rates even though the 2010 average test scores were higher than for 2009 on one test, the same on one test, and lower on two tests.

And the rate of change in passing was more than twice that of former years where the results looked right. The change was all in the same direction: up.

1. The difference in behavior of the Texas ROPIRT model in 2010 was do to:

   a. political influence.
   b. Texas losing control again.
   c. student performance.
   d. all of the above.
   e. none of the above.

Please be able to support your answer from your own experience or with information from trusted sources. (Good judgment is to omit if you cannot trust your mark to be right.)

Friday, June 25, 2010

Understanding and Trusting NCLB Test Standards - TAKS

After eight years there is still a problem with people not understanding and trusting NCLB standardized testing according to the Texas TAKS social studies grade 8 article, “Qualms arise over TAKS standards”, in the Houston Chronicle by Ericka Mellon, 7 June 2010.

‘State Rep. Scott Hochberg, vice chairman of the House Public Education Committee said in the Houston Chronicle, “You can get more than halfway to passing just by guessing”.’

A distribution of expected lucky scores from the test with 48 questions and 4-option answers shows this to be correct, on average.

‘TEA Deputy Associate Commissioner Gloria Zyskowski said agency officials set the bar high enough so “students can’t pass the test by chance alone.”’

One very lucky student out of a 100 needs to add 2 right marks to pass. One very unlucky student out of a 100 needs to add 17 right marks to pass. Students cannot pass the test by luck alone. The unfairness of students starting the test with lucky scores ranging from 5 to 19 is not very important on the this test as 95 percent passed the test with an average score over 80%. [YouTube]

‘Sarah Winkler, the president of the Texas Association of School Boards, was shocked to find out Monday that the TEA doesn’t set the passing bar – called the cut score – until after students take the TAKS.’

This practice takes the TEA out of the game. They no longer have to make a bet on what the cut score should be (and to respond to all of the ramifications if they are wrong). They can bring all their expertise to bear on setting the most appropriate cut score. An operational test is not governed by research rules and hypothesis testing of average scores.

Operational testing is concerned about each student when the results determine pass or fail a grade. Is this case, and especially when low cut scores are used, it would be nice if students could also get out of the game of right count scoring (guess testing). The TEA can do this by using Knowledge and Judgment Scoring

that lets all students start the test at the same score and gives equal value to what they know and to the judgment needed to make use of what they know. It assesses all levels of thinking, an innovation ready for the next revision of NCLB. The social studies test yielded an average score over 80%. The TEA could also use Confidence Based Learning scoring that functions at the mastery level.

‘“We didn’t do anything differently than previous years,” said TEA spokeswoman Debbie Ratcliffe. “It wouldn’t be fair to kids if this test wasn’t at the same difficulty level from year to year.”’

The test characteristic curves, used by psychometricians, for the eight years bare this out. The curves for six of the eight years fall directly on top of one another with a cut score of 25. This is an outstanding piece of work. The year 2005 shows a slight deviation (cut score of 24) and 2010 a much greater deviation in difficulty (cut score of 21). The minute breaks in the scale scores at 2100 and 2400 are the standards for met and commended performance levels. (PLEASE NOTE that these curves descend to zero on tests that are designed to generate a lowest lucky score of 12 out of 48 questions, on average. This is no problem for true believers.)

‘TEA officials say the questions, for the most part, were harder this year, so they followed standard statistical process and lowered the number of items students needed to get correct.’ But were the questions harder or the students less prepared?

The TEA is faithfully following the operating rules that come with their Rash one-parameter IRT model analyzer (ROPIRT). For the thoroughly indoctrinated true believer a ROPIRT works like a charm in a space with arbitrary dimensions. There is a mysterious interaction between the average score of a set of anchor questions embedded in each test, the average right count test score, the cut score, and the percent passing, on each test, and with the preceding test, within the ROPIRT. Only the last two or three are generally posted on the Internet. For the rest of us, we must judge its output by the results it returns to the real world.

The eight-year run of the social studies grade 8 test shows some informing behavior. Years 2003 and 2004 were originally assigned cut scores of 19 and 22. That yielded passing rates of 93% and 88%. Later, all years were assigned a cut score of 25 except for 2005 (24) and 2010 (21). Now to weave a story with these facts.

Starting in 2003 with a cut score set at 25, 77% passed the test with an average test score of 65.5%. The average test score increased by 5.2% in 2004 to 70.8%. This was not enough to trigger a change in the cut score. The passing rate increased to 81%.

The average test score remained stationary in 2005. This triggered a 4% change in the cut score by one count from 25 to 24. The ROPIRT decided that the test was more difficult this year so the passing rate should be adjusted up from 81 to 85%.

The average test score increased by 4.2% in 2006 to 75%. This triggered a 4% change in the cut score by one count from 24 back to 25. The ROPIRT decided that the test was too easy this year so the passing rate should be adjusted down from 85 to 83%.

The average test score increased by lesser amounts in 2007, 2008, and 2009 (3.1, 2.1, and 2.1%). These did not trigger an adjustment in the cut score.

In 2010, the average test score decreased by only 2.2% to 80.2%, the same average score as in 2008. The ROPIRT decided the test was way too difficult by changing the cut score by 4 counts from 25 to 21. This was a 16% adjustment in cut score for a 2.2% change in the average test score.

The amount of the adjustment is not consistent with the previous adjustments. The resulting passing rate for 2010 of 95% is not consistent with the passing rate for 2008 (90%) with the same average test score. The ROPIRT (which to my knowledge) only looks back one test, is drifting away from previous decisions it has made. [YouTube]

The Texas data show four interesting things:

If students do too well on a test, it is declared too easy and the cut score is raised to lower the pass rate even though they may have actually performed better.
If students do too poorly on a test, it is declared too difficulty and the cut score is lowered to raise the pass rate even though they may have actually performed poorly.
If the above goes on long enough, the whole process drifts away from the original benchmark values and requires recalibration.
A benchmark cut score can be revised based on the results of following years. This is consistent with the ROPIRT operating instructions to remove imperfect data until you get the right answer.

Calibrating questions with a ROPIRT for use in time saving computer assisted testing (CAT) is valid. Using it to equate tests over a period of seven years is another matter. By design (an act of faith), a ROPIRT cannot error as it lives in a perfect world of true scores (these error free true scores, the raw scores found on the raw score to scale score conversion tables, are generally considered to be on the same scale as the right count test scores even though each student’s right count test score is influenced by a number of factors including item discrimination and lucky scores). Error occurs when imperfect data are fed into a ROPIRT that are not manually detected and removed. The blame game then ends with operator inexperience. Since Texas is using a Rasch Partial-Credit Model in a ROPIRT mode, it could use Knowledge and Judgment Scoring to reduce the error from traditional right count scoring.

For someone outside a State Department of Education to assess the operation of their ROPIRT, the investigator would need a minimum set of information for each year of the test: The mean of the anchor set of questions embedded in each test that is the primary determiner of the change in the cut score, the mean of the right count scored student tests, the cut score, and the percent passing. I have yet to find a state that posts or will provide all four of these values. Texas posts the last three. Arkansas posts the last two.

Are the test results politically influenced? From the data in hand, I don’t know enough to know. High scores (now high pass rates that are sensitive to low scores) are needed to meet federal standards. The shape of the gently ever more slowly rising curve for the passing rate appears to be more carefully choreographered than to be a direct result of student performance for several states. I think a better question is: Is this from political influence or the result of a smoothing effect created when using (and learning to use) a ROPIRT? The revised passing rates for 2003 and 2004 on the social studies grade 8 test give us a mixed clue.

Wednesday, May 19, 2010

Wallpapering Traditional Multiple-Choice Tests

Wallpapering is preparing, in advance of the test, a mark pattern to be used when students do not have answers they can verify and trust. Students have three options after marking all the questions that can be used to report what is known or can be done:

Turning in the answer sheet yields an accurate, honest, but unfair score unless omit or judgment is given a value equal to, or higher than, knowledge; as is done with Knowledge and Judgment Scoring (KJS) and Confidence Based Learning (CBL).

Randomly marking the remaining questions gives judgment a value of zero. The score is less accurate, honest, and fair the lower it gets until it only reflects answer sheet marking ability. The test is a high anxiety academic casino game at the lowest levels (orders) of thinking.

Wallpapering is a defensive measure. It reduces test anxiety. It increases fairness and test security. It shares the same good luck.

Being prepared reduces test anxiety. This includes how to make a forced-choice mark when you do not have a trusted answer. The age-old advice is to pick one option, such as C. Wallpapering adds one more step: Everyone in the class makes the same mark (with KJS and CBL everyone just omits).

A fair test requires a fair starting score (which exists with KJS and CBL).

The active starting score on traditional multiple-choice tests is about 33%, on average. That is a range of independent starting scores of about two letter grades. Wallpapering reduces this range.

Wallpapering produces a security code. The wallpaper marking-pattern can be made as elaborate as needed. Over half of the marks on an answer sheet can come from wallpapering when test scores drop below 50%. A set of answer sheets marked right and wallpapered, and with no erasures, indicates no tampering.

NCLB raw scores below 40% are now listed as Proficient in several states. The distribution of scores from marginal students with equal abilities follows the normal curve of error. The distribution widens as the test scores descend. It is gambling. Some pass. Some fail. This is not fair.

Wallpapering reduces this unfairness. All students in the group (class) mark the same answer when they cannot trust making a right mark. They do the same thing at the same time rather than individually trust to luck. This does not change their individual test scores, on average.

Wallpaper is an answer sheet created BEFORE seeing the test. Individual variation is markedly reduced. The simplest example is for all in the group to agree to mark the same letter when in doubt. More variable patterns can be created using mnemonics for easy memory. Short patterns can repeat every few questions. The Christmas tree repeats every 4 questions (A, B, C, D) on a 4-option test. Longer patterns can use poetry and music.

Doing the same thing at the same time has evolved in birds as a means of protecting individual members of the flock from predators. The tight formation protects individual members and decreases the energy needed to fly. The same protection and energy savings applies in schools of fish.

Wallpapering has this effect for marginal students taking tests using RMS. It reduces the random lucky score variation in individual test scores. Wallpapering allows students to do the same thing at the same time with equal ease when marking a trusted right answer or marking the equivalent of omit using KJS or CBL. A few minutes of planning equal a few millennia of evolution in protecting marginal students from the vagaries of NCLB testing.

Multiple Choice Bubble Sheet Template:

http://teacherspayteachers.com

Teacher-Author: E. Fisher Price: FREE

Wednesday, May 12, 2010

My Score Quality

Examiners can tell students, parents, and employers how a score relates to other examinees on a test. But how does it relate to everything else?

What does my score mean other than I passed the Arkansas Algebra I (AAI) end of course test? Am I ready for Algebra II? Have I mastered the general lifetime skills supported by learning Algebra? Did I take a lower-order thinking appreciation course or a higher-order thinking skills course? Did I just pass a graduation requirement and get a grade? Are the newspapers right that the course is not tough enough, that the passing cut score is too low?

Arkansas is one of five states to have a Statewide Uniform Grading Scale for classroom tests. This is one way of indicating quality. The final determiner is how students perform on their next unit, next semester or next job assignment.

Quality varies between states. The letter grade of “C” ranges from 70% to 77%. A classroom “D” is 60% in Arkansas and Florida and 70% in South Carolina and Tennessee. The quality of a test score is dependent upon a number of factors including scale scores. The AAI raw score equivalent to a classroom pass = 24%.

If the AAI test were all multiple-choice, every score falls in the shadow of the lucky scores. The score of 25 is nonsense. The cut scores of 21, 24 and 37 could be obtained by just marking the answer sheet without looking at the test. All cut scores would be shady “no quality” scores.

Replacing 40 of the multiple-choice questions with five open-response questions toughens up the test. The lucky scores on the AAI 4-option question test now cast a shadow over just half of the playing field, from the 15% to the 60% line. A score of 15 can be expected from lucky scores, down from 25, on average. Both 24 and 37 fall about half shaded. They have a quality score of less than 50%. Any score below 50% is a low quality score. Right mark scoring (RMS) holds students accountable for their luck on test day, as much as or more than, for what they know or can do.

Psychometricians were not on the side of the students when they included the five open-response questions. However these questions are, in general, non-functional. The test designed for 100 points actually functions as a test based

on 60 points. The functional passing scores are 40% (24) and 60% (37) out of 60 even though the designed passing scores are 24 and 37 out of 100. Few multiple-choice tests using RMS function as designed.

The AAI is designed for students to mark their best guess at the “best answer” on each question. Individual student test scores below 50% only have meaning after being averaged into a class or school score ranking. (RMS remains the least expensive way to obtain school rankings.) This research technique fails to apply to individual students. A test score of 37%, on a crippled multiple-choice test (no omit), is also a quality score of 37%. The test is not designed for students to report what they trust they know and can use as the basis for further learning and instruction. That requires the option missing on tests using RMS: omit (I have yet to learn this).

RMS and knowledge and judgment scoring (KJS) can be combined on the same test as a means of gently nudging students out of the habit of guessing, to reporting what they actually know. The test scores and student counseling matrixes guide students on the path from passive pupil to self-correcting high achiever. There is an additional dimension of information available that is not obtainable with RMS even when using the same test questions.

(Wallpaper has a third use with RMS. Along with reducing test anxiety, and the variation in lucky score starting positions, it allows KJS to extract ¾ of the quality information lost with RMS. A wallpaper key is added to the answer key and weight key.)

The learning cycle shortens as passive pupils become self-correcting high quality achievers. Boring classes become exciting adventures. A multiple-choice test that randomly passes and fails low performing students of equal abilities with RMS becomes a seek-and-find task to report what is meaningful and useful for each student with KJS and Confidence Based Learning (CBL).

When students elect to report what they know and trust with KJS or CBL, they receive a quantity score, a quality score and a test score. High quality students obtain individual confirmation that they do know what they know and that they are skilled at using this knowledge regardless of the quantity of right marks. Success is doing more of what each student is good at doing. This is in contrast to RMS where doing more of what low scoring students are doing (guessing right answers) is a continuation of failure (a practice in continually failing schools).

Assessment should produce high quality scores and promote the development of high quality students. CBL differentiates questions into informed, uninformed, misinformed and good judgment to omit, to question, and not make a serious error. KJS sorts questions into expected, difficult, misconception and good judgment to not make a wrong mark and thus report what has yet to be learned. Quality is independent from quantity.

Secretary of Education, Ernie Duncan’s opinion: “At a time when we should be raising standards to compete in the global economy, more states are lowering the bar than raising it. We're lying to our children when we tell them they're proficient but they're not achieving at a level that will prepare them for success once they graduate.”

Thursday, May 6, 2010

Three Multiple-Choice Games

Three multiple-choice games can be played on the same field. Each has its own rules for scoring and grading. [YouTube]

The 2009 Arkansas Algebra I (AAI) end-of-course test has the game field designed with 100 points, the same number as yards on a football field. The field slopes from a swamp down at the left end were the guessers play up to dry land were the 100% goal posts stand.

The number of answer options for each multiple-choice question controls the difficulty of play related to luck. The more options per question, the more skilled the players must be to win and the fewer lucky winners. Anyone can play when right mark scoring (RMS) is used: students, employees, and animals (the target of the original complete multiple-choice test that included omit).

The Arkansas Uniform Grading Scale rules set the letter grades of D to A at 60 to 90 for traditional right mark scoring (RMS) on classroom tests. The static starting score is set to zero. The hidden active starting score is 25, on average.

The Arkansas Algebra I (AAI) end-of-course test replaces 40 multiple-choice with five 8-point open response questions. The hidden active starting score is reduced from 20 to 15, on average. The test is now ten points, or one letter grade, more difficult. A student cannot pass the test by guessing.

The active starting score, the lucky score, is hidden at the left end of the playing field in the foggy swamp where the guessers play among the lucky-score trees. The traditional classroom game starts here with lower order thinking skills. Students are encouraged to guess from 5, 4, 3, or 2 options. Only right marks count as blank and omit have no value with RMS.

Confidence Based Learning (CBL) only plays on dry ground near the goal posts. It uses 3-option questions. It starts play at the 75% (25-yard) line for good judgment, far away from the swamp of shady scores. Mastery players receive points for both knowledge and their skillful to use their knowledge (their judgment). They attempt to reach the 100% goal posts. They make few, if any, wrong marks.

Knowledge and Judgment Scoring (KJS) starts play at the 50% (50-yard) line for good judgment. Students functioning at lower levels of thinking can mark every question (which may put them back in the swamp with RMS). Students and employees functioning at higher (all) levels of thinking use the test to report what they trust. Their goal is to make the highest number of right marks with the fewest number, if any, of wrong marks. [YouTube]

A universal score board sums the rules for the three methods of scoring. Scoring is compared in passive, static, mode after the test is finished; and in active, dynamic, mode during the test. Scoring for KJS and CBL are usually expressed in the active, dynamic, mode as the scoring starts with the value given to perfect judgment, 50% or 75% (no wrong marks have been made at the start of the test).

Scoring for RMS is usually expressed in the passive, static, mode after the test paper has been turned in. This allows resetting the starting score (and the value of judgment) to zero. This has deceptive consequences. Students like the apparent “no risk” feature. They also like the help from lucky marks. What they do not realize is that every wrong mark reduces their lucky score.

Changing from RMS to KJS or CBL is about the same as changing from a tricycle to a bicycle. It is changing from external control and correction to internal control and self-correction; from linear, low order, thinking to include high order, cyclical, thinking. It takes practice; about three experiences.

It is scary to do something new. Who ever heard of getting one point for a right mark and one point for the good judgment to not make a wrong mark (omit)? It is done on every essay test where students report what they know and trust, and omit what they have yet to learn.

Students quickly like KJS as it saves them time not having to come up with “the best answer” to a question they cannot read or understand. They like to see the quality score confirm what they trust; what they really know and can build on.

They like the freedom to customize the multiple-choice test to match their preparation (a 90% quality score), as they do on most other assessments. This is effective formative assessment as students learn to question, to answer, and to confirm as they are learning in preparation for assessment. They are in charge as they develop from passive pupil to self-motivated high achiever.

Teachers benefit too. KJS and CBL differentiate misconceptions, where students think they know the answer but do not, from just guessing on difficult questions. Students are sorted by their level of thinking (teachable level), as well as, by what they know. Each student presents a quantity, a quality, and a test score. You have accurate, honest and fair numbers to support you classroom observations.

Since the three methods of scoring are based on different skills, the Universal Cut Point Raw Score Grade Equalizer or other methods can be used to assign grades (2009 Arkansas End-of-Course Raw to Scale Score Conversion Table and State Law).

All three methods produce the same raw score when examinees fail to exercise good judgment and mark all questions in hope of getting a lucky passing score. An accurate and honest performance produces the highest score, on average.

Wednesday, April 28, 2010

My Lucky Score

Students and teachers are as interested in what the next test score will be as in the latest test score. Will it be at or above an expected score? What can be expected from luck? [YouTube]

The portion of the time each student will be lucky can be obtained from charts in the previous blog. These charts show the number of lucky scores obtained when the answer sheets were marked without looking at the test.

The number of lucky scores becomes the expected frequency of lucky scores for each student. The bar graph becomes an uncluttered line graph.

On 4-option questions, a student can expect to receive a lucky test score of 15 out of 60, about 1/8^th of the time (0.12), by just marking the answer sheet without looking at the test.

Half of the time, the lucky test score is expected to be 15 or less, and half of the time 15 or more. Students can increase their luck by deleting one or more answer options. The average lucky score becomes 20 when one option is deleted on each question.

Students can turn luck on and off by the decisions they make and the chances they take. The Arkansas Algebra I (AAI) test contains sixty 4-option multiple-choice questions. How students take the test determines how difficult it will be. If students think of options not on the test, they make the test more difficult, a 4-option question becomes a 5-option question or more. They are going in the wrong direction.

Rather than picking a right answer, delete wrong answers and then guess. At the other extreme, if students can discard all but two options, on average, they can expect a lucky score of 30 out of the 60 questions, or 50%. [The higher order thinking skills needed to do this are promoted in the classroom by Knowledge and Judgment Scoring (KJS) and Confidence Based Learning (CBL). Students do not need to know “the right answers” to beat standardized tests. They need a practiced self-judgment.]

The expected average score is a stable value between 15 and 20. Where each student’s (my) lucky score will fall under that average is not. There is no way to predict each student’s lucky score. That is what makes luck enticing. We can predict the average lucky score and the range in which the lucky score will occur very well. Students can always pass the test with proper preparation.

The inability to predict individual student lucky scores is of little consequence with Confidence Based Learning (CBL), or the ACT and SAT, as chance has little effect at the mastery level of learning and performing. It has a devastating effect on students with similar abilities being selected to pass or fail a test with raw scores below 50%. Using an average score protects teachers and schools. It has taken forced disaggregation of NCLB test scores to prevent hiding low performance by groups smaller than about 30 students from being masked by the high performance of other students.

Fair means chance will distribute scores in a “bell shaped curve” or under the “normal curve of error.” (If there are enough questions on the test. The AAI, with 60 questions, has enough.) The curve has the name “normal” because this is what happens when you know nothing on the test, or mark the test without looking at the test booklet. It could be called the “know nothing curve”.

On a multiple-choice test scored only by counting right marks, Right Mark Scoring (RMS), there are no qualification runs to put the best or the worst at the head of the pack. Instead, chance assigns each student a secret handicap; luck, on test day. The student with the least ability in your class may draw 20 points and the next student may only draw 10. This is fair with RMS rules as both students have an equal opportunity to draw. [YouTube]

Some people believe that tests, especially high-stakes tests, should not be games of chance. They let examinees report what they know, based on their own judgment. Both knowledge and judgment are scored, just as on projects, essays, job assignments, and reports.

Knowledge and Judgment Scored (KJS) tests and Confidence Based Learning (CBL) tests give you a quantity, quality and test score. This form of testing and learning, in the classroom, promotes the student development needed for your students to be winners on any test based on high quality work.

Next, the three games played on a multiple-choice playing field, from traditional RMS (guess testing) to obtaining accurate, honest and fair scores.

Monday, April 26, 2010

Multiple-Choice Lucky Scores

The news headlines could have been, “Cheat or Chance” or “Trick or Teach,” this past year. The cut score for passing a multiple-choice test, scored by only counting right marks, continued to fall. The traditional multiple-choice test scoring method was being pushed over a credibility limit.

Aug 11: “City students are passing standardized tests just by guessing”

Aug 14: “TOUGHEN THE TESTS”

Aug 17: “Guessing My Way to Promotion”

Sep 14: “Botched Most Answers on New York State Math Test? You Still Pass”

Sep 16: “Is any test reliable? CRCT? SAT? NAEP? ACT? Pick one”

Oct 31: “Ducan: States ‘set bar too low’”

Jan 11: “As School Exit Tests Prove Tough, States Ease Standards”

The 100-point 2009 Arkansas Algebra I (AAI) end-of-course test, mentioned in the last article, is a good example to examine to see how standardized testing actually works:

Items for new AAI versions are trial-tested, in a current operational test, rather than field-tested on a selected sub-sample at a different time.
A statewide Uniform Grading Scale is monitored for inflation by comparing the pass rate in school with the pass rate on the AAI.
Arkansas has had a nearly perfect yearly increase in the AAI test score for the past nine years (see page 24 of 28).

The multiple-choice portion of the test is played on the traditional field of varying quality. At the high end, everyone knows what the examinee knows or can do, including the examinee. The scoring in Confidence Based Learning (CBL) plays in this region, as does the SAT and ACT when used to pick top quality winners.

Traditional Right Marked Scoring (RMS), used on the AAI, are played at the other, lower, end of the field. The examinee guesses and waits for the test score and even then no one knows what the student knows or can do, including the examinee.

Knowledge and Judgment Scoring (KJS) permits students to individualize their test to match their preparation. They can opt for RMS or for KJS. They can opt for the teacher to tell them what they have right, or for reporting what they know and trust is right. They can opt for lower or higher-order thinking.

Chance plays almost no part in CBL. Chance is the main determiner of lucky scores. [YouTube] This holds for any test using RMS, including the SAT, ACT, and end-of-course tests.

The effects of unaltered pure chance can be seen on tests such as the AAI when:

The answer sheets are marked randomly without looking at the test booklet.
The answer sheets have no erasures.
No marking pattern is used such as wallpapering. Wallpapering reduces test anxiety by students agreeing, before the test, how they will mark forced-choice guesses (when they have finished reporting what they know and trust, but must not omit or not leave blanks).
Student judgment is absent or is given no value (RMS).

There are several ways to score the effects of chance on multiple-choice tests:

Randomly mark 100 AAI answer sheets for the 60 multiple-choice questions.
Use a quincunx board.
Use the Excel function: BINOMDIST.

The quincunx board allows you to see chance in action; that force behind what is called creativity in Arts, Letters, and Politics, and is also called error in Science, Math and Engineering. The quincunx board works well for normal classroom tests with about 25 students (balls) and 8 questions (9 bins). (Number each student. Run slowly. Have each student follow his/her ball as it falls into a bin. Repeat and compare results for an added effect.)

The Excel function BINOMDIST can be set for almost any number of students and questions. A set of 100 answer sheets produces a surprisingly uniform distribution even though the right answer is expected by chance but 1/4^th of the time.

The graph of 4-option questions shows that no student can expect to pass the AAI by guessing. Classroom passing is set equal to 24 raw score points out of 100 points in Arkansas. The maximum lucky score on the sixty 4-option questions was 23, and that only happened about 1 out of 100 students. The required passing cut score of 37 points for graduation in Arkansas is far beyond the reach of lucky scores. [YouTube]

But students can alter these results by exercising higher-order thinking skills. If students can, on average, discard one option on each question, they are then working with a 3-option question test. The classroom test equivalent of 24 raw score points can be passed with lucky scores. Some 17 (6 + 4 + 3 + 2 + 1 + 1) out of 100 students passed by guessing from the remaining three options. Students who do this are often referred to as “test wise.”

Students, teachers, test makers, and administrators can manipulate the effects of chance, for their benefit, in other ways.

Wednesday, January 20, 2010

Classroom and Standardized Test Grades

Does a grade, or cut point, tell us what happened or just the appearance, that a politician or an administrator wants to give, of what happened?

Even a simple question, “Why did I get the same grade on my math test as another student got on a government test when our test scores differed by more than ten percentage points?” has no simple answer.

A Universal Cut Point Raw Score Grade Equalizer helps put things into perspective:

Most teachers, who just count right marks, use a 10-point range scale, as it is easy to remember the cut points of 90, 80, 70, and 60%. Every student can earn any letter grade (all can be A’s if all have mastered the assignment).

Other teachers use the average test score to select a scale for assigning grades. Average tests scores ranging from 70% up to 92.5% produce a range of grades for a raw score of 88%. It is an A on a 12-point range scale, a B on a 6-point scale, a C on a 4-point scale, and a D on a 3-point scale.

There are many ways to assign grades. In general, a test score below 80% means the student is not keeping up with the course and will not be prepared for the next course, whatever grade is assigned.

Right mark scoring (RMS) grades are easily manipulated by the selection of questions, question difficulty, and cut points in the classroom and on standardized tests. Lowering the cut point to 40% (a range scale of 15 points and a quality score of 40%) insures that a portion of students will pass by luck alone. There is no way to know what the student actually trusted as a basis for further learning and instruction.

Knowledge and Judgment Scoring (KJS) and Confidence Based Learning (CBL) value judgment (quality) independently from knowledge. The student is in charge of reporting what he can trust and what he has yet to master. KJS and CBL reward students for taking the responsibility to learn beyond the concrete level. They are rewarded for learning, anywhere and anytime, not just in class. They ask questions, get help, and put in the time needed to master the assignment. It feels good to have mastered a clearly stated and understandable assignment.

KJS and CBL grades are not easily manipulated since there is a score for what is known and the degree to which it can be trusted. The grades reflect what self-motivated achievers are doing rather than how lucky passive pupils were on test day.

In my opinion, one of the main reasons schools that show a marked increase in RMS standardized test scores one year and no further increase in the following years is that passive pupils can only be pushed so far in traditional classrooms. Student development to produce self-motivated achievers, functioning at all levels of thinking, is needed to go further. These are the graduates that are successful in what they do next in school and beyond.

There are many ways for schools to promote mastery, and not just the appearance of mastery. KJS is a bridge to mastery. CBL guarantees mastery.

Multiple-Choice Reborn

Followers

Blog Archive

About Me