## Wednesday, November 6, 2013

### The Value and Meaning of a Mark

The bet in the title of Catherine Gewertz’s article caught my attention: “One District’s Common-Core Bet: Results Are In”. As I read, I realized that the betting that takes place in traditional multiple-choice (TMC) was being given arbitrary valuations to justify the difference between a test score and a classroom observation. If the two agreed, that was good. If they did not agree, the standardized test score was dismissed.

TMC gives us the choice of a right mark and several wrong marks. Each is traditionally given a value of 1 or 0. This simplification, carried forward from paper and pencil days, hides the true value and the meanings that can be assigned to each mark.

The value and meaning of each mark changes with the degree of completion of the test and the ability of the student. Consider a test with one right answer and three wrong answers. This is now a popular number for standardized tests.

Consider a TMC test of 100 questions. The starting score is 25, on average. Every student knows this. Just mark an answer to each question. Look at the test and change a few marks, that you can trust you know, to right. With good luck on test day, get a score high enough to pass the test.
If a student marked 60 correctly, the final score is 60. But the quality of this passing score is also 60%.

Part of that 60% represents what a student knows and can do, and part is luck on test day. A passing score can be obtained by a student who knows or can do less than half of what the test is assessing; a quality below 50%. This is traditionally acceptable in the classroom. [TMC ignores quality. A right mark on a test with a score of 100 has the same value, but not the same meaning as a right mark on a test with a score of 50.]

A wrong mark can also be assigned different meanings. As a rule of thumb (based on the analysis of variance, ANOVA; a time honored method of data reduction), if fewer than five students mark a wrong answer to a question, the marks on the question can be ignored. If fewer that five students make the same wrong mark, the marks on that option can be ignored. This is why Power Up Plus (PUP) does not report statistics on wrong marks, but only on right marks. There is no need to clutter up the reports with potentially interesting, but useless and meaningless information.

PUP does include a fitness statistics not found in any other item analysis report that I have examined. This statistic shows how well the test fits student preparation. Students prepare for tests; but test makers also prepare for the abilities of test takers.

The fitness statistic estimates the score a student is expected to get if, on average, as many wrong options are eliminated as are non-functional on the test, before guessing; with NO KNOWLEDGE of the right answer. This is the best guess score. It is always higher than the design score of 25. The estimate ranged from 36% to 53%, with a mean of 44%, on the Nursing124 data.  Half of these students were self-correcting scholars. The test was then a checklist of how they were expected to perform.

With the above in mind, we can understand how a single wrong mark can be devastating to a test score. But a single wrong mark, not shared by the rest of the class can be taken seriously or ignored (just as a right mark, on a difficult question, by a low scoring student).

To make sense of TMC test results requires both a matrix of student marks and a distribution of marks for each question (Break Out Overview). Evaluating only an individual student report gives you no idea whither a student missed a survey question that every student was expected to answer correctly or a question that the class failed to understand.

Are we dealing with a misconception? Or a lack of performance related to different levels of thinking in class and on the test; or related to the limits of rote memory to match an answer option to a question? [“It’s the test-taking.”] When does a right mark also mean a right answer or just luck on test day? [“This guy scored advanced only because he had a lucky day.”]

Mikel Robinson, as an individual, failed the test by 1 point. Mikel Robinson, as one student in a group of students, may not have failed. [We don’t really know.] His score just fell on the low side of a statistical range (the conditional standard error of measurement; see a previous post on CSEM). Within this range, it is not possible to differentiate one student’s performance from another’s using current statistical methods and a TMC test design (students are not asked if they can use the question to report what they can trust they actually know or can do).

We can say, that if he retook the test, the probability of passing may be as high as 50%, or more, depending upon the reliability and other characteristics of the test. [And the probability of those who passed by 1 point, of then failing by one point on a repeat of the test, would be the same.]

These problems are minimized with accurate, honest, and fair Knowledge and Judgment Scoring (KJS). You can know when a right mark is a right answer using KJS or the partial credit Rasch model IRT scoring. You can know the extent of a student’s development: the quality score. And, perhaps more important, is that your students can trust what they know and can do too; during the test, as well as after the test. This is the foundation on which to build further long lasting learning. This is student empowerment.