12
I discussed the relationships between statistics in Post 11
based on changing individual answer marks. This post will start the exploration
of the limits on these statistics based on student scores and item difficulties.
How much change is produced by a specific strategy?
These statistics can all be used to describe what has happened in meaningful useful
ways. Here I begin to look at
their ability to predict what may happen.
This concern has practical value in being able to safely package descriptive
and predictive statistics for maximum marketing effect. Or reworded, are
testing companies really delivering what they claim?
Chart 32 relates observations from seven exploratory
strategies:
 A perfect Guttman scaled table in a 21 by 21 cell field.
 A 21 by 20 Guttman table missing the item with a difficulty of zero.
 A 20 by 20 Guttman table also missing the student score of zero.
 A 20 by 20 normal curve distribution based on the SD of the above table.
 A 20 by 20 distribution based on a typical classroom, SD = 10.
 A 20 by 20 bimodal distribution at 35% and 65% student scores.
 A 20 by 20 perfect world distribution for the above table.
The three Guttman scaled
tables produced similar results. Removing one or both zero values had
little effect. The SD remained around 30% (Chart 32).
The normal curve
table was also similar to the 20 by 20 Guttman scaled table but with a
reduced SD (25.26%). I was
impressed that a normal curve distribution based on the 20 by 20 table SD had so little
effect on the SEM and KR20. It did indicate that a shortened score range lowers
the SD.
I then configured a table for a typical classroom SD. The SD = 10 table produced the smallest SEM
(4.87%). SD = 9.25%. Clearly, to have a small SEM, you must have a small SD.
But you also run the risk of low test reliability (0.72).
I next configured the table for a typical classroom
distribution generally seen when using Knowledge
and Judgment Scoring (KJS). Routine use of KJS produces and sorts out those students,
who have learned to be comfortable using all levels of thinking, from the
remaining passive pupils in the class who continue to select traditional
multiplechoice (TMC). The bimodal table
produced an SEM of 5.98%. SD = 16.38%.
I then compared a bimodal table and a perfect world table (see Post 7 in this series). In a perfect world
all students receive the same low score or the same high score. These modes
were set at the same 35% and 65% modes on both tables. The perfect world table (Table
18) produced similar results, SEM (5.39%). SD=15.39%.
(Free download of Table 18: http://www.ninepatch.com/download/VESEG3501.xlsm
or .xls)
I explored the perfect world table further after seeing that
the results for the perfect world table and the bimodal table were close at the
35%65% mode locations. Chart 33 shows the effects of changing the range of the
right and wrong modes on a perfect world table. Increasing the range increased the SEM (red) and test reliability
(purple). The first effect is bad, the second effect is good. The price for a high test reliability is a reduced ability to tell if two student scores are significantly different.
Post 11 shows linear relationships when changing individual
marks, except for item discrimination. This post, working with student scores
and item difficulties, shows a linear relationship for item discrimination: AVG PBR of 0.1, 0.2, 0.3, 0.4 and 0.5 (Chart 33). As the two answer modes are moved farther apart, the average PBR and SD increase linearly, but appear curved on the log base 10 scale.
Standardized tests tend to have high SDs. The score
distributions tend to be flat, multimodal. This situation is related to high
test reliability and high SEM (Chart 33).
A standardized test must do better than this: under the best
possible conditions a SEM of about 5% is related to a test reliability of about
0.90 with the modes set at 35%65% and a SD of 15%. That would take a
difference in score from one year to the next of 3 x 5 = 15% or one and
onehalf letter grade to support an acceptable improvement (difference) in
student performance but with an unattainable test performance.
The above tables were populated with items drawn from a
Guttman scaled table with all items set at their maximum
item discrimination. The results then represent the best obtainable, the maximum
limit for a 20 by 20 table. We need more students and more test items.
                   

Free software to help you and your students
experience and understand how to break out of traditionalmultiple choice (TMC)
and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):