Wednesday, September 25, 2013

Visual Education Statistics - Frequency Estimation Equating


                                                                20
Frequency Estimation Equating involves conditioning on the anchor or a set of common items. This post reports my adventures in figuring out how this is done as I needed to know how to do this to complete the next post on conditional standard error of measurement (CSEM).

Two 24 student by 15 item tests were drawn from the Nursing124 data. Included in each was a set of 6 common items that were marked the same in both tests: A and B (Table 20). Student scores varied between Test A and Test B based on their marks on other, non-common, items. The common items were sorted by their difficulty.

I then followed the instructions in Livingston, (2004, pp. 49-51). The values in Table 20 were tabulated to produce “a row for each possible [student] score” and “a column for each possible score on the anchor [common item]” (Table 21). The tally is turned into frequencies conditioned on the common item scores by dividing each column cell by the common item score or difficulty. The frequencies for each common item sum to 1.00.

Next, the unknown population proportions are obtained by combining (multiplying) the common item frequencies with the equal portion each common item contributed (1/6) to the test (Table 21). These values now represent the on-average expectations for each cell based on the observed data. Summing by rows produces the estimated (best guess) unknown population student score distribution that could have also produced the on-average expectations. This was done for both Test A and Test B.

[This operation can be worked backward (in part) to yield the right mark tally. Dividing the population proportions by the number of items in the sample yields the right mark frequencies. Multiplying the right mark frequencies by the difficulty yields the right mark tally. But there is no way to back up from the estimated population distribution to this set of population proportions, let alone to individual student marks. The right mark tally is a property of the observed sample and of individual student marks. This estimated population distribution is a property of the unknowable population distribution related to the normal curve. The unknowable population distribution can spawn endless sets of population proportions. Monte Carlo psychometric experiments can be clean of the many factors that affect classroom and standardized test results.]

Charts 59 and 60 show the effect produced by conditioning on the common items. This transformation from observed to on-average expectations appears to rotate the distribution about the average test score of 84% and 80%, respectively, for both Test A and Test B. It made a detectable increase in the frequency of high scores and a similar decrease in the frequency of low scores. This increased the average scores to 86% and 84%, respectively. Is this an improvement or a distortion?

“And when we have estimated the score distributions on both the new form and the reference form, we can use those estimated distributions to do an equipercentile equating, as if we had actually observed the score distributions in the target population.” I carried this out, as in the previous post, with nothing of importance to report.

So far in this series I have found that data reduction from student marks to a finished product is independent from the content actually on the test. The practice of using several methods and then picking the one that “looks right” has been promoted. Here the creation of an unknown population distribution is created from observed sample results. Here we are also giving the choice of selecting Test A or Test B or combining the results. As the years pass, it appears that more subjectivity is tolerated in getting test results that “look right” when using traditional, non-IRT, multiple-choice scoring. This charge, formerly, was directed at the Rasch model IRT analysis.

It does not have to be that way. Knowledge and Judgment Scoring and partial credit Rash model IRT allow a student to report what is actually meaningful, useful, and empowering to learn and apply what has been learned. This property of multiple-choice is little appreciated.

What the traditional multiple-choice is delivering is also little understood (psychometricians guessing to what extent sample [actual test] results match an unknowable standard distribution population based on student marks that include forced student guessing on test items the test creators are guessing students will find equally difficult, as based on a field test, they guess will represent the current test takers, on average).

We still see people writing, “I thought this test was to tell us what [individual] students know.” Yet, traditional, forced-choice, multiple-choice can only rank students by their performance on the test. It does not ask them, or permit them, to individually report what they actually know or can do based on their own self-judgment: just mark every item (as a missing mark is still considered more degrading to an assessment than failing to assess student judgment).

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):