## Wednesday, October 9, 2013

### Multiple-Choice Test Analysis - Summary

22
The past 21 posts have explored how classroom and standardized tests are traditionally analyzed. The six most commonly used statistics are made fully transparent in Post 10, Table 15, the Visual Education Statistics Engine (VESE) [Free VESEngine.xlsm or VESEngine.xls]. One more statistic was added for current standardized tests. Numbers must be meaningful, understood; to have valid, practical value.

•       Count: The count is so obvious that it should not be a problem. But it is a problem in education.  Counting right marks is not the same as counting what a student knows or can do. Also a cut score is often set by selecting a point in a range from 0% to 100%. A cut score of 50 means 50%. But the test, when administered as traditional multiple-choice starts each student at 25% with 4-option questions. [There is no way to know what low scoring students know, only their rank.]

•       Average: Add up all of the individual student scores and divide by the number of students for the class or test average score. [There is no average student.] Classes or tests can be compared by their averages just as students can be compared by their counts or scores.

•         Standard Deviation (SD): Theoretically, 2/3 of the counts on a distribution of scores are expected to fall within one SD of the average. A very well prepared (or very under prepared) class will yield a small SD. A mixed class will yield a large SD with students with both very high and very low scores (many A-B and D-F, with few C grades).

•       Item Discrimination: A discriminating question groups those who know (high scoring students) into one group and those who do not know (low scoring students) into another group. Every classroom test needs about ten of these to produce a grade distribution where one SD is ten percentage points (a ten point range for each grade).

•       Test Reliability: A test has high reliability when the results are highly reproducible. Standardized tests, therefore, use only discriminating questions. They rarely ask a question that almost all students can answer correctly. Traditional multiple-choice, therefore, does not assess what students actually know and value. Traditional standardized tests can only rank students.

•       Standard Error of Measurement (SEM): Theoretically, 2/3 of the time a student retakes the same test; the scores are expected to fall within one SEM of the average. The SEM value fits inside the range of the SD. “Jimmy, you failed the test, but based on your test score and your luck on test day, each time you retake the test, you have a 20% expectation of passing without doing any more studying.” The SEM precision is based on the reliability of the entire test.

•       Conditional Standard Error of Measurement (CSEM): The CSEM is based (conditioned) on each test score. This refinement in precision is a recent addition to traditional multiple-choice analysis. It has been a part of the Rasch model IRT analysis for decades.

Even the CSEM cannot clean up the damage done by forcing students to mark every question even when they cannot read or do not understand the question. Knowledge and Judgment Scoring and the partial credit Rasch model do not have this flaw. Both accommodate students functioning at all levels of thinking and all levels of preparation.  These two scoring methods are in tune with the objectives of the CCSS movement.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):