Statistic Six: Item discrimination, the last statistic in this series of posts, captures the ability of an item to group students by what they know (and by what they have yet to learn, with Knowledge and Judgment Scoring or partial credit Rasch model scoring). Previous posts have indicated that this ability may be primary in selecting items for standardized tests. It is also important in the classroom. Discriminating items produce the spread of scores needed for setting grades in schools designed for failure.
I left this statistic to last as it is a bit different from the others. It is more complex and difficult to calculate. However, the standard error of measurement (SEM) engine, post 8, only needed one more step to have the numbers in hand to calculate the Pearson r estimate of item discrimination.
Pearson worked out his item discrimination in a manner that follows the previous posts. He did this by 1895, long before we had personal computers. As a consequence we now have two versions, called the original uncorrected estimate (Excel Pearson function) and the corrected estimate. There is also a shortcut for traditional multiple-choice (TMC) tests: the point biserial r (PBR) I consider at the end of this post.
A visual presentation of the Pearson item discrimination calculation follows (see Table 11 for the calculations).
First, the marks in the Item 4 column on the Guttman table (Table 12) are counted (10), the average obtained (0.45 out of 22), and the deviations from the mean obtained (Chart 20).
The same process is carried out on the student score columns (RT of 369 and SCORE MEAN of 16.77 out of 22, see Chart 21).
When each of these two charts is summed, it adds to zero. This time the individual values are not squared to make them all positive as in Charts 22 (scores) and 23 (items). Instead the related item and score deviations are multiplied to produce positive and negative values (Chart 24 and Table 11) that sum to 13.27.
The item discrimination is then a ratio between two sums of squares (SS). This operation is carried out for each item on the test:
Multiplying the two SSs in the denominator (after taking their square roots) changes negative values to positive values and yields a grand SS (2.34 x 9.49 = 22.21). The resulting ratio is the discrimination ability of the item. It can range from a minus one to a positive one. Values above 0.9 are characteristic of standardized tests. Values for classroom tests will be discussed later.
Table 12 contains an Item Discrimination Engine you can use to explore the discrimination ability of individual items. [Download free from http://www.nine-patch.com/download/IDEngine.xlsm or .xls]
The point biserial r (PBR) provides an additional glimpse into what is taking place (Table 13). The difference between the average right marks and wrong marks (18.1 – 15.67 = 2.43) is standardized by dividing by the standard deviation (2.43/2.07 = 1.176). Multiplying the difference between right and wrong mark means in standard units (1.176) by the proportion (p and q) of right and wrong marks, Sqrt(0.45 x 0.55) = 0.2475, yields the PBR item discrimination of 0.59.
The real value or meaning of an item discrimination rank seems to be a matter of tradition and advances in computing power. PUP 5.20 prints out corrected item discrimination values that I gave the following rankings for my classroom tests:
[The PBR only works for traditional multiple-choice, that only ranks students. PUP contains the Pearson r that is required for Knowledge and Judgment Scoring, an actual assessment of what students know and can do, that is meaningful and useful in future assignments.]
Item discrimination weights each right and wrong mark with the related student score. Different column mark patterns produce different results. Unlike test reliability, when calculating item discrimination the order, or pattern, of marks is important. Items of the same difficulty can have very different discrimination ability, for example, items 11, 14, 15, 16 and 18 with a difficulty of 91% and a range of item discrimination of -0.02 to 0.58 (Chart 25).
Selecting difficult items is not sufficient to maximize test reliability. The primary need is to write discriminating items. The Nursing124 data delivered discriminating items at all levels of difficulty from 45% to 91% (Chart 25).
The item discrimination results seemed to me to be as unpredictable as test reliability results. IMHO only a visual education statistics engine that combines all six statistics can readily display the interactions.
The standard error of student score measurement (SEM), the test reliability (KR20, and alpha), and the item discrimination (Pearson and PBR) have unpredictable interactions. The Test Performance Profile from PUP 5.20 brings these together in one table for easy use in the classroom by students and teachers (and other interested persons) but lacks the flexibility of a single sheet spreadsheet engine.
[PUP 5.20 only prints the PBR ranks as an efficient aid for teachers. An additional aid is provided by sorting the discriminating items on PUP 5.20, sheet 3a. Student Counseling Mark Matrix with Mastery/Easy, Unfinished, and Discriminating (MUD) Analysis.]
- - - - - - - - - - - - - - - - - - - - -
Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):