I discussed the relationships between statistics in Post 11 based on changing individual answer marks. This post will start the exploration of the limits on these statistics based on student scores and item difficulties. How much change is produced by a specific strategy?
These statistics can all be used to describe what has happened in meaningful useful ways. Here I begin to look at their ability to predict what may happen. This concern has practical value in being able to safely package descriptive and predictive statistics for maximum marketing effect. Or reworded, are testing companies really delivering what they claim?
Chart 32 relates observations from seven exploratory strategies:
- A perfect Guttman scaled table in a 21 by 21 cell field.
- A 21 by 20 Guttman table missing the item with a difficulty of zero.
- A 20 by 20 Guttman table also missing the student score of zero.
- A 20 by 20 normal curve distribution based on the SD of the above table.
- A 20 by 20 distribution based on a typical classroom, SD = 10.
- A 20 by 20 bimodal distribution at 35% and 65% student scores.
- A 20 by 20 perfect world distribution for the above table.
The three Guttman scaled tables produced similar results. Removing one or both zero values had little effect. The SD remained around 30% (Chart 32).
The normal curve table was also similar to the 20 by 20 Guttman scaled table but with a reduced SD (25.26%). I was impressed that a normal curve distribution based on the 20 by 20 table SD had so little effect on the SEM and KR20. It did indicate that a shortened score range lowers the SD.
I then configured a table for a typical classroom SD. The SD = 10 table produced the smallest SEM (4.87%). SD = 9.25%. Clearly, to have a small SEM, you must have a small SD. But you also run the risk of low test reliability (0.72).
I next configured the table for a typical classroom distribution generally seen when using Knowledge and Judgment Scoring (KJS). Routine use of KJS produces and sorts out those students, who have learned to be comfortable using all levels of thinking, from the remaining passive pupils in the class who continue to select traditional multiple-choice (TMC). The bimodal table produced an SEM of 5.98%. SD = 16.38%.
I then compared a bimodal table and a perfect world table (see Post 7 in this series). In a perfect world all students receive the same low score or the same high score. These modes were set at the same 35% and 65% modes on both tables. The perfect world table (Table 18) produced similar results, SEM (5.39%). SD=15.39%.
(Free download of Table 18: http://www.nine-patch.com/download/VESEG3501.xlsm or .xls)
I explored the perfect world table further after seeing that the results for the perfect world table and the bimodal table were close at the 35%-65% mode locations. Chart 33 shows the effects of changing the range of the right and wrong modes on a perfect world table. Increasing the range increased the SEM (red) and test reliability (purple). The first effect is bad, the second effect is good. The price for a high test reliability is a reduced ability to tell if two student scores are significantly different.
Post 11 shows linear relationships when changing individual marks, except for item discrimination. This post, working with student scores and item difficulties, shows a linear relationship for item discrimination: AVG PBR of 0.1, 0.2, 0.3, 0.4 and 0.5 (Chart 33). As the two answer modes are moved farther apart, the average PBR and SD increase linearly, but appear curved on the log base 10 scale.
Standardized tests tend to have high SDs. The score distributions tend to be flat, multi-modal. This situation is related to high test reliability and high SEM (Chart 33).
A standardized test must do better than this: under the best possible conditions a SEM of about 5% is related to a test reliability of about 0.90 with the modes set at 35%-65% and a SD of 15%. That would take a difference in score from one year to the next of 3 x 5 = 15% or one and one-half letter grade to support an acceptable improvement (difference) in student performance but with an unattainable test performance.
The above tables were populated with items drawn from a Guttman scaled table with all items set at their maximum item discrimination. The results then represent the best obtainable, the maximum limit for a 20 by 20 table. We need more students and more test items.
- - - - - - - - - - - - - - - - - - - - -
Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):