## Wednesday, July 10, 2013

### Visual Education Statistics - Equating

18
The past few posts have shown that if two tests have the same student score standard deviation (SD) they are easy to combine or link. Both tests will have the same student score distribution on the same scale.

Equating is then a process of finding the difference between the average test scores and applying this value to one of the two sets of test scores. Add the difference in average test score to the lower set of scores, or subtract it from the higher set to combine the two sets of test scores.

This can be done whenever the SDs are within acceptable limits (considering, all factors that may affect the test results, the expected results, and the intended use of the results). This is IMHO a very subjective judgment call to be made by the most experienced person available.

There are two other situations: same average test score but the different SDs are beyond acceptable limits, and both test score and SD differences are beyond acceptable limits for the two tests. In both cases we need to equate the two different SDs, the two different distributions of student scores.

Chart 48 is a re-tabling of Chart 44. The x-axis in Chart 48 shows the set Standard Deviation (SD) used in the VESE tables in prior posts. Equating a low SD test (10) to a high SD test (30) has different effects then equating a high SD test (30) to a low SD test (10). The first improves the test performance; the second reduces the test performance.

There is then a bias to raise the low SD test to the high SD test. “The test this year was more difficult than the test last year,” was the NCLB explanation from Texas, Arkansas, and New York. [It was not that the students this year were less prepared.]

The most frequent way I have seen mapping (Livingston, 2004, figure 2, page 14) done is to plot the scores of the test to be equated on the x-axis and the scores of the reference test on the y-axis. The equate line for two tests with similar average test scores and SDs is a straight line from zero through the 50% point on both axes (Chart 49).

If the average test scores are similar but the SDs are different, the equate line becomes tilted to expand (Chart 50) or contract (Chart 51) the equated values to match the reference test. Mapping from a low SD test to a higher SD tests leaves gaps. Mapping from a high SD test to a low SD tests produces clumping, in part, from rounding errors.

Mapping a new difficult test to an easier reference test with the same SD increases the values on the equating line, as well, as truncates it. Any new test scores over 30 on Chart 52 have no place to be plotted of the reference test scale.

The equating with an increase in both SD and average test score expands the distribution and truncates the equating line even more (Chart 52). A comparison of the two above situations as parallel lines (Chart 53) helps to clarify the differences.
Both increase the new difficult test average test score value of 20 counts to 30 counts on the reference scale. In this simple example based on a normal distribution, the remaining values increase in a uniform manner of equal units of 10 with the same SD and 15 when mapping to the larger SD.

The significance of this is that in the real world, test scores are not distributed in nice ideal normal distributions. The equating line can assume many shapes and slopes.

The unit of measure needed to plot an equating chart must include equivalent portions of the two distributions. Percentage is a convenient unit: equipercentile equating. [More on this in the next post.]

Whither Test A is the reference test, or Test B is the reference test, or both are combined as one analysis is the difficult subjective call of the psychometrician. So much depends on the luck on test day related to the test blueprint, the item writers, the reviewers, the field test results, the test maker, the test takers and many minor effects on each of these categories.

This is little different from predicting the weather or the stock market, IMHO. [The highest final test scores at the Annapolis Naval Academy were during a storm with very high negative air ion concentrations.] The above factors also need to include the long list of excuses built into institutionalized education at all levels.

On a four-option item, chance alone injects an average 25% value (that can easily range from 15 to 35%) when students are forced to mark every item on a traditional multiple-choice (TMC) test. Quality is suppressed into quantity by only counting right marks: Quality and quantity are therefore linked into the same value. TMC high test scores have higher quality then lower test scores, but this is generally ignored.

It does not have to be that way. Both the partial credit Rasch model IRT and Knowledge and Judgment Scoring permit students to report what they trust they know and can do and what they have yet to learn accurately, honestly and fairly. No guessing is required. Both paper tests and CAT tests can accept, “I trust I know or can do this,” “I have yet to learn this,” and if good judgment does not prevail, “Sorry, I goofed.”  Just score 2, 1, and 0 rather than 1 for each right mark (for whatever reason or accident).

A test should encourage learning. The TMC at the lower scores is punitive. By scoring for both quantity and quality (knowledge and judgment) students receive separate scores, just as is done on most other assessments. “You did very well on what you reported (90% right) but you need to do more to keep up with the class” rather than “You failed again with a TMC score of 50%.

Classroom practice during the NCLB era tragically followed the style of the TMC standardized tests conducted at the lowest levels of thinking. The CCSS tests need to model rewarding students for their judgment as well as right marks. [We can expect the schools to again doggedly try to imitate.] It is student judgment that forms the basis for further learning at higher levels of thinking: one of the main goals of the CCSS movement. The CCSS movement needs to update its use of multiple-choice to be consistent with its goals.

Equating TMC meaninglessness does not improve the results. This crippled form of multiple-choice does not permit students to tell us what they really know and can do that is of value for further learning and instruction.

- - - - - - - - - - - - - - - - - - - - -

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):