Adding more items with the same difficulties to a perfect world test on the VES Engine (Table 18) did not change the average student score (50%), the standard deviation (SD of 15.39%), the standard error (SE of 3.44%) or the average item discrimination (PBR of 0.30). The test reliability (KR20) improved but the standard error of measurement (SEM) made a marked improvement (Chart 35).
This makes sense. The more items on a test the greater the test reliability; the greater the test reliability, the smaller the range into which repeated student testing scores can be expected to fall. By doubling the number of items, twice, 20 to 80 items, the SEM fell from 5.39% to 2.64%. By doubling twice again to 320 items, the value again was reduced by half to 1.32%.
The common core state standards (CCSS) movement is now bringing into practice testing with an average difficulty of 50%. This optimizes test performance, but bullies students.
A class of 20 students, IMHO, can produce usable results if eight 40-item tests are used during the course. With a SEM of 1.32%, scores from the same student would only need to be 1.32% x 3 = 3.94% apart to show acceptable improvement in performance.
Testing companies can then market a single test, with a total number of items from 80 to 160, which will rank students and teachers with acceptable precision based on test scores. Each student will have to read every item on paper. Computer adaptive testing (CAT) will generally require less than that number, which means, CAT students will not take the same test.
Again testing is optimized for the testing companies who are only being required to rank students. They can calibrate items on a group of representative students. They can then present different items, but comparable only in difficulty, as equivalent items. This only makes sense if every student has the same general background and preparation and is an average student with average luck on test day. The practice reduces individuality and eliminates creativity. It does not have to be that way.
Armed with the above ability to rank students, testing companies are also marketing more tests: formative, summative, and in between “submative” (neither formative nor summative). The same items can be used on all three. The difference is that the formative process takes place in such a timely manner that the student learns (in seconds to minutes at higher levels of thinking and in minutes to days at lower levels of thinking). The summative test measures what has happened, not what is being learned at the moment.
The “submative” test falls in between as a subtest, but again measures the past. IMHO it also hints that buying such a test is better, in the short term, for school administrators, than letting a good teacher assess in a normal classroom. Relying on short term, lower level of thinking, tests that only rank students does not promote the development students need to become successful self-educable high quality achievers. (CCSS movement multiple-choice test questions may be highly contrived requiring considerable problem solving skills, but are still scored easier than a bingo operation: good luck on finding the right answer, with 1/4 free instead of 1/25 free.)
It does not have to be that way. The very same items can be scored to promote student development; function as formative experiences, and provide immediate guidance for teaching. Just because testing companies can deliver high quality rankings does not mean we should limit the return on the time and money invested (by students, teachers, and tax payers) to just ranking. This cripples schooling. The decade of NCLB experiences present the evidence here.
As suggested in the previous post, we need more than 20 test items and IMHO a test scored for what students trust they actually know and can do such as Power UP Plus by Nine-Patch Multiple-Choice, partial-credit Rasch model by Winsteps and Amplifire by Knowledge Factor.
- - - - - - - - - - - - - - - - - - - - -
Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):
[I again checked the test reliability values with the Spearman-Brown prophecy formula (Table 19). At this high end of the range, they closely matched the results from the VES Engine. The test with 20 items made four predictions that were increasingly close to the observed (x1) test reliability.]