The information that needed to be related in post 7, became too long for one post. Post 7 contains the SEMEngine; all five of the related statistics on one spreadsheet. This post relates a collection of stuff that gives those statistics additional meaning; a bit of understanding needed to use them properly.
The SEMEngine, in the previous post, can produce the unpredictable statistics relevant to classroom tests and standardized tests. But a full understanding of these statistics requires a discussion of a second standard error and the two methods of scoring multiple-choice (traditional multiple-choice, TMC, and Knowledge and Judgment Scoring, KJS); partial and full disclosure of that a student knows and can do.
The standard deviation (SD) of the group test score and the standard error of measurement (SEM) of the average student test score provide guidance in constructing standardized tests as predictive inputs. These statistics are also helpful in describing classroom test results. The first refers to test results from the class or group taking the test, the average group score; the second, to the average student score in the class. They are two different perspectives of the same average score. They have different uses.
There is a second standard error, the standard error of the mean (SE) that permits comparison between group test scores. [I am belaboring this topic as the two standard errors (of the mean and of measurement), the abbreviation (SEM) and even the SD can get confused (Standard error and Standard error vs. Standard error of measurement).
Chart 18 shows how the SEM of the average student score is reduced as more equivalent items are added to the Cantrell data of 14 items. A 50 item test is expected to yield a SEM of 5.15%. This is less than 1/3 the range of the SD. But even this would require an improvement of 3 x 5.15 = 15.45% for a significant increase in performance from one year or one test to the next. That is 1.5 times a traditional letter grade. To my knowledge, very few standardized tests use 50 items in any topic or skill area.
Chart 19 shows how the SE of the classroom or group test score is reduced as more equivalent items are added to the Cantrell data. The SE has a finer resolution than the SEM. An improvement in class performance on a 50 item test, 3 x 2.57 = 7.71% would require only about a 3/4 letter grade to show a significant difference in the two test scores from two different classes or one class at two different times. This shows that it is easier to show a significant difference between the average scores from two tests than it is between two scores from the same student.
[The above can be generalized to support the traditional score range of 10% per letter grade.]
I retitled this post as “Teacher Effectiveness” after looking at the above two charts (18 and 19). These statistics provide a means of measuring teacher effectiveness; or at least ranking teacher effectiveness. To measure teacher effectiveness, the portion of students electing TMC or KJS on the test would also have to be included.
[A class selecting mostly TMC is in a lower level of thinking classroom environment populated with passive pupils conditioned to mark an answer to every item. A class selecting mostly KJS is in a higher (all) level of thinking classroom environment populated with self-motivated, self-correcting high quality achievers who are mature enough to distinguish between what they have yet to learn and what they know and can do that can serve as the basis for further learning and instruction.]
Student development is as important as knowledge or skill. The CCSS movement promotes this idea too but without the simplicity of multiple-choice (in time and money).
These visualized statistical models of the real world have been found to have practical value in making predictions (a most expected mid-point on a range of possibilities). However, what we feed into these statistics determines the validity and usefulness of the results. The concrete reality that you got a score of 50% on a classroom test becomes transformed into an abstract prediction that, +- 1 SD, that score (and your next score on an equivalent test) just might have been anywhere between 30% and 70% on an equivalent standardized test. And further, using the SEM, the range may be reduced to between 45% and 55% (generalized from Table 18).
Test scores (and these first five reviewed statistics) are easily manipulated by the selection of questions on the test and how the test is scored. The traditional multiple-choice test (forced-choice test) is a game with a built in handy-cap of over 20%. This manipulation of scores is so traditional (so hardened to change) that little thought is given to it with the exception of when elementary school students take their first multiple-choice tests.
Learning to lie is difficult for serious students; they know a best guess is not a reflection of their abilities. It is just sugar coating and a distraction from the ugly truth. Students with equal abilities, but receiving lower test scores, rightly feel cheated by their poor luck on test day. In time, these students just mark, finish the test, and then get back to their world where they do have some control. Since there is no way of knowing if a right mark is a right answer or a lucky answer, there is no need to take the test seriously except for where their score falls in the class distribution (their rank).
[This practice is institutionalized when their class rank is provided in college admission documents.]
The traditional multiple-choice test (TMC) is fast, cheap, and marketed way beyond its valid ability to rank students IMHO. It is, as my students put it, Dumb testing. The statistics are not an accurate, honest and fair reflection of their individual abilities.
TMC IMHO drives students away from developing into self-motivated, self-correcting, high quality achievers. Statistics will not change the outcome. There is a better (alternative) method of multiple-choice assessment, KJS, at no additional cost that will guide their development. An effective teacher motivates students to be ready to learn and to want to learn.
A multiple-choice test can be used to permit students to report what they actually know, understand, and find useful as the basis for further learning and instruction. All that is required is an extraction of student judgment (something that is considered an essential part of almost all alternative and authentic assessments and soon the elaborate CCSS assessments). Please check out Smart testing: Knowledge and Judgment Scoring, partial credit Rasch model, and Confidence Based Assessment, for example. All three promote student development that yields high test scores, long term, and with a minimum of review.
- - - - - - - - - - - - - - - - - - - - -
Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):