Standardized test makers use statistics to predict what may happen; classroom statistics describe what has happened. Classroom tests include two or three dozen students. Standardized test making requires several hundred students. Classroom tests are given to find out what a student has yet to learn and what has been learned. Standardize tests are generally given to rank students based on a benchmark test sample. Classroom and standardized tests have other significant differences even though they may use many of the same items.
I took the two classroom charts (37 and 38 in a previous post) and extended the standard deviations (SD) from 5 - 10%, to 10 - 30%; a more realistic range for standardized tests (Chart 44). At a 70% average score and 20% SD the normal curve plots of 40 students by 40 items started going off scale. I then reversed the path back to the original average score of 50% as the SD rose from 20% to 30%.
The test reliability (KR20) continued to rise with the SD for these normal distributions set for maximum performance. The item discrimination (PBR) rose slightly. The relative SEM/SD value decreased (improved) from 0.350 to 0.157 as test reliability increased (improved).
The two tests with average test scores of 50% yielded very different test reliability and item discrimination values for SD values of 10% and 30% on Chart 44; the greater the distribution spread, the higher the KR20 and PBR values. [I plotted the N – 1 SD to show how close the visual education statistics engine (VESE) tables were to their expected normal curves.]
The SD is then a key indicator of test performance; the spread of the student score distribution, the main goal for standardized test makers. It is also very sensitive to extreme values. The 30% SD plot was made by teasing the VESE table that I set for 30% SD. The original SD value was near that for a perfect Guttman table (each student score and each item difficulty appear only once), about 28%. By moving four pair of marks, near the extreme ends of the distribution, one count more toward the end, the SD rose to 30%. That is moving four pair of marks out of 400 pair one count each to change the SD by 2%.
The standard error of measurement (SEM) under optimum normal test conditions remained about 4.4% (Chart 44). So, 4.4 x 3 = 13.2%. A difference in a student’s performance of more than 13.2% would be needed to accept the scores as representing a significant improvement with a test reliability of 0.95. All of the above mark patterns were not mixed; which is an unrealistically optimum performance.
I looked again at the effect of mixing right and wrong marks on an item mark pattern with a higher SD value than found in the classroom (Chart 45). The change from a SD of 10% to 20% was much smaller than I had anticipated. The effect of deeper mixing was again linear.
Average item difficulty sets limits on the maximum PBR that can be developed (Chart 46). In a perfect world where all items are marked either all right or all wrong, the maximum PBR is 1.0 for individual items.
Looking back at prior posts, I found lower values on a perfect Guttman table (0.84) and a normal curve table set at 30% SD (0.85). The PBR declined along with the SD set to 20% and 10% (Chart 46).
These values hold for tests with average test scores that range from 50% to 70%.
There is now enough information to construct the playing field upon which psychometricians play (Chart 47). I chose two scoring configurations: Perfect World and Normal Curve with a SD of 20%. The area in which standardized tests exit is a small part of the total area that describes classroom tests. The average student score and item difficulty were set at 50%.
An item mark pattern at 50% difficulty can produce a PBR of 1.0 in a perfect world (blue). All right marks are together and all wrong marks are together. The PBR drops to zero with complete mixing (Table 20). It falls to a -1.0 when all right marks are together at the lower end of the mark pattern.
The area for the normal curve distribution (red) with a SD of 20% fits inside the perfect world boundary. This entire area is available to describe classroom test items. Items that are easier or more difficult than 50% reduce the maximum possible PBR. They have shorter mark patterns. And here too, fully mixed patterns drop the PBR to zero.
We can now see the problem psychometricians face in making standardized tests. The standardized test area is about 1/8th of the classroom area. Standardized tests never use negative items (that almost excludes misconceptions which cannot be distinguished from difficult items using traditional multiple-choice scoring; as they can using Knowledge and Judgment Scoring).
Chart 44 indicates an average PBR of over 0.5 is need for the desired test reliability of over 0.95 under optimum conditions (no mark pattern mixing). With just ¼ mixing, the window for usable items becomes very small. The effect of mixing right and wrong marks on an item mark pattern varies with item difficulty. A test averaging 75% right with unmixed items would be the same as a test averaging 50% right with partially mixed items.
A 2008 paper from Pearson, by Tony D. Thompson, confirms this situation. “This variation, we argue, likely renders non-informational any vertical scale developed from conventional (non-adaptive) tests due to lack of score precision” (page 4). “Non-informational” means not useful, not valid, does not look right, and does not work, IMHO. “Conventional” means, in general, paper tests and the fixed form tests being developed by PARCC for online delivery for the Common Core State Standards (CCSS) movement.
This comment may be valid for “many educational tests” (page 14). “Also, if an individual’s observed growth is much larger than the associated CSEM, then we may be confident that the individual did experience growth in learning.” This indicates that using simulations within the playing field, as Thompson did, confirms my exploration of the limits of the playing field. [And the CSEM, which is applied to each score, is more precise than the SEM based on the average test score.]
“While a poorly constructed vertical scale clearly cannot be expected to yield useful scores, a well-defined vertical scale in and of itself does not guarantee that reported individual scores will be precise enough to be support meaningful decision-making” (page 28). This cautionary note was written in 2008, several years into the NCLB era.
The VESE tables indicate that the “best we can do” is not good enough to satisfy marketing department hype (claims). Testing companies are delivering what politicians are willing to pay for: a ranking of students, teachers, and administrators only based on a test producing scores of questionable precision. Additional use of these test scores is problematic.
An unbelievable situation is currently being challenged in court in Florida. Student test scores were used to “evaluate” a teacher who never had the students in class! It reveals the mind set of people using standardized test scores. They clearly do not understand what is being measured and how it is being measured. [I hope I do by the end of this series.] Just because something has been captured in a number does not mean that the number controls that something.
Scoring all the data that can be in the answer sheets would provide the information (which is repeatedly sought but ignored in traditional multiple-choice) needed to guide student, teacher and administrator development. Schools designed for failure (“Who can guess the answer?”), fail. Schools designed for success have rapid, effective, feedback with student development (judgment) held as important as knowledge and skills. Judgment comes from understanding, a goal of the CCSS movement.
- - - - - - - - - - - - - - - - - - - - -
Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):