Two recent news items highlight the problems produced by faulty communication in the assessment, education and political communities. A test may be inadequate to deliver the requested information resulting in two different scenarios: The test is used in the state of Washington. A test is not used but replaced with a Projection Measure that “was so silly that it was killed” after a brief use in Texas.
Large amounts of money are involved in such exercises. A satisfied customer in this area must be able to understand the limits of what is being purchased. We do not want to show up for dinner at 7:00pm to find it was served at 12:00 noon. Dinner and lunch can refer to the same thing and to different things. It depends upon the culture.
Psychometricans have been lax in communicating what they do in an understandable form to the cultures that finance them and to those who attempt to make valid use of their work. During 50 years of experience, I have not found a unified expression of common education statistics or a way of accomplishing that feat that is meaningful and therefore useful. The personal computer, the interactive spreadsheet, and the Internet should now make this possible.
This set of posts is designed so that anyone interested in the topic of multiple-choice testing can see inside of six commonly used education statistics. The series will also include Excel what-if engines to animate them. You only understand after you have experienced. It is only when several statistics are combined that the interactions and limits become visible. Combining statistics interactively also simplifies the naming of variables as only one name is needed where several may be used otherwise.
I will attempt to produce an understandable graphic for each of six common education statistics that I have encountered being used with traditional multiple-choice tests (TMC):
- average or mean
- standard deviation of the mean or the spread of the distribution of scores
- test reliability or the ability to reproduce the same scores
- standard error of measurement or the range in which a student’s score may fall
- item discrimination or the ability of a question to group students into one group that knows (and is lucky) and one group that does not know (and is unlucky).
If you are comfortable with traditional education statistics, you may want to skip to the first spreadsheet: Test Reliability Engine. If you are interested in the findings summary of this audit, skip to [to be posted]. If you are interested in the details as I work through this project, please read on.
Your comments will be appreciated, especially errors and omissions (corrections are easily made on a blog). I want the facts to be readily seen and understood rather than you relying on me as one more authority (“trust me”, from Jungle Book, and any number of commercial, education and political organizations).
Please practice with your students using Break Out (free) to learn to understand the difference between traditional multiple-choice (TMC) and Knowledge and Judgment Scoring (KJS). The Common Core State Standards (CCSS) movement demands that passive pupils become engaged active self-correcting high quality achievers.
The student mark data from the Nursing124.ANS file contains the right marks by 22 students on 21 questions. Extreme scores and difficulties (100%) were eliminated from the 24 by 24 matrix when I was working on my audit of the Rasch model.
Statistic One: Right mark counts yield student scores (rows) and item difficulties (columns). The value of each student score mark (1 or 0) is not affected by item difficulty or the level of thinking used in making the mark. The value of each item difficulty mark or item score (1 or 0) is not affected by student score or student ability. A right mark is a right mark (1). The more right marks you get, the better, is meaningful to everyone using traditional multiple-choice (TMC).
[The above remarks are prompted from my audit of the Rasch IRT model. The claim (see Number of IRT Parameters) is made that student abilities are independent from item difficulties and item difficulties are independent from student abilities using the one-parameter IRT model. I am willing to believe that theory but I have yet to see it. I do not know or understand it based only on how estimates of student ability and item difficulty are made.]
Counts are typically listed in a mark, or item score, table. Student scores are entered at the end of rows. Item difficulties are listed at the lower end of columns. This looks very clean and simple (1 and 0), especially when compared to what is being attempted to be measured. A mark of 1 or 0 may result from many factors that are related to the item, or to the student, or to factors indirectly related to the test environment (race, religion, parenting, etc.).
A good analogy is a test plot of corn kernels from several ears of corn (rows) placed in several types of soil (columns). The scoring is based on the seedlings. Several factors can be scored: color; development of leaf, stem, and roots; size of plant, stem and root; sturdiness; and etc. But in education, with traditional multiple-choice (TMC), there would be but two scores: 1 for a seedling, and 0 for none. A 1 would be recorded for both a corn seedling and a weed seedling. A weed corresponds to good luck in marking a right answer. All the other factors that influence student marks are ignored.
Even in Table 2 all right answers have been replaced with a single symbol to make the chart easier to view. That symbol will become a 1 using TMC. Each wrong mark, regardless of the answer option, will become a 0.
But one factor, other than right/wrong, can be obtained directly from the answer sheets. That factor is student judgment. Student judgment is as important as knowing and doing, in moving students from lower to higher levels of thinking. The CCSS movement demands the development of student judgment.
Counting right marks is simple. However, each mark is not reporting the exact same thing. Forcing students to mark “the best answer” and counting right marks produces a quantitative score locked to a qualitative score (that is why only one score is reported using TMC, as the two scores are identical). That deficiency is easily corrected by the Rasch IRT partial credit model (PCM) or by Knowledge and Judgment Scoring (KJS).
KJS yields independent scores of quantity (1 or 0) and scores of quality (scoring a student’s judgment to report what is actually known or can be done, that is the basis for further learning and instruction). Weeds can be differentiated from corn.
With KJS both teachers and high quality students know what is known and can be done during the test as well as afterwards. By scoring for knowledge and judgment (quantity and quality) we can reduce the weeds in the corn. We can identify and correct misconceptions. Instruction can be more effective.
The most important thing that can be said at this point is that what you count and how you count determines the value of everything that follows. TMC, with right mark scoring, extracts the lease amount of information with the least value from a multiple-choice test. You get the least return for the time and money invested: a ranking.
Tradition seems to be the main reason TMC is still used. KJS and the PCM both shift the responsibility for learning and reporting from the teacher to the student. This shift is now a key element in the Common Core State Standards (CCSS) movement. It promotes the change from a classroom of passive followers to an active classroom of self-correcting high quality successful achievers. Assessing judgment may now become acceptable, and even required, when using multiple-choice tests (as it is in most other assessments).
Students like to be free to report what they trust they know and can do. But this must be experienced to be understood, appreciated, and accepted. After two tests, over 90% of my 3,000 students switched from guessing at answers on a multiple-choice test, to using it to report what they trusted they knew or could do. Teachers also need to experience before they understand (scoring judgment with multiple-choice tests is still a new professional development topic).
The CCSS movement demands doing, not talking and listening. To make the most of this series of posts, download Break Out. (It is in entirely free open source code.) Use it to help break out of an old antiquated failing tradition that emphasizes one right answer instead of the CCSS requirement of developing the ability and mindset to apply what is known to a range of questions or tasks.