The tools psychometricians favor are most sensitive when a question divides the class into two equal groups of right and wrong. This situation only exists when scoring traditional multiple-choice (TMC) at one point in a normal score distribution: at an item difficulty of 50%.
The invention of item response theory (IRT) made it possible to extend this situation (half right and half wrong) to the full range of item difficulties. IRT also allows expressing item difficulty and student ability on the same logit (log odds) scale.
IRT calibrated items make computer adaptive testing (CAT) possible. Items are grouped by the estimated difficulty that matches the estimated student ability needed to make a right response 1/2 of the time.
Typically, students must select one of the given options. Omit, or “I have yet to learn this”, is not an included option. The failure to include student judgment is a legacy from TMC (see previous posts).
Traditional CAT is therefore limited to ranking examinees. It is a very efficient way to determine if a student meets expectations based on a group of similar students. It is the solitary academic version of Family Feud.
The game is simple. Answer the first question. If right, you will be given a bit more difficult question. If wrong, you will be given a bit less difficult question.
If you are consistently right, you finish the test with a minimum of questions. The same can be said for being consistently wrong.
In between, the computer seeks a level of question that you get right half of the time. If an adequate number of selections fall within an acceptable range, you pass, and the test ends. Otherwise the test continues until a time limit or item count is reached and you fail.
If doing paper tests for NCLB was considered the biggest bully in the school, CAT increases the pressure. You must answer each question as it is presented.
You are not permitted to report what you know. You are only given items that you can mark right about 1/2 of the time. You are in a world far different from a normal classroom. It is more like competing in the Olympics.
You are now CAT food. Originality, innovation, and creativity are not to be found here. Your goal is to feed the CAT the answer your peer group selected for you as the right answer 1/2 of the time (that is right, they did not know the right answer 1/2 of the time either).
Playing the game at the 1/2 right level is not reporting what you trust you know or can do. It is playing with rules set up to maximize the desired results of psychometricians. Your evaluation of what you know does not count.
Your performance on the test is not an indication of what you trust you know and can do, but it is generally marketed as such. This is not a unique regulatory situation.
Sheila Bair, Chairman of the Federal Deposit Insurance Corporation, 2006-2011, described the situation in NCLB in terms of bank regulators, “They confuse their public policy obligations with whether the bank is healthy and making money or not.” (Charlie Rose, Wed 10/31/2012 11:00pm, Public Broadcasting System)
Psychometricians confuse their public obligation to measure what students know and can do with their concern for item discrimination and test reliability. This has perpetuated TMC, OMC, and CAT using forced-choice tests. The emphasis has been on test performance rather than on student performance.
[Local and state school administrators further modify the test scores to produce an even more favorable end result, expressed as percent improvement and percent increase by level of performance, and at the same time they suppress the actual test scores. Just like big bankers gambling with derivatives!]
IRT bases item calibration on a set of student raw scores. Items are then selected to produce an operational test of expected performance from which expected students scores can be mapped. These expectations generally fail. Corrections are then needed to equate the average difficulty of tests from one year to the next year.
The Nebraska and Alaska data show that the exact location of individual student ability is also quit blurred. An attempt to extract individual growth (2008) therefore understandingly failed on a paper test, but showed promise using CAT.
CAT is now (2010) being promoted as a better way than using paper tests to assess individual growth far from the passing cut score. [Psychometricians have traditionally worked with group averages, not with individuals.]
Forced-choice CAT, at the 1/2 right difficulty level, is the most vicious form of naked multiple-choice. Knowledge Factor uses an even higher standard, but clothes items in an effective instructional system. Also all items assess student judgment.
The claims that CAT can pin point exactly what a student knows and does not know are clearly false. CAT can rank a student with respect to a presumably comparable group.
To actual know what a student knows or can do requires that you devise a way for the student to tell you. There is a proliferation of ways to do this that for the most part require subjective scoring. Most are compatible with the classroom.
My favorite method was to visit with (listen to) a student answering questions on a terminal. It is only when fully engaged students share their thinking that you can observe and understand their complete performance. This practice may soon be computerized and even made interactive given the current development of voice recognition.
Judgment multiple-choice (JMC) allows an entire class of students to tell you what they trust they know and can do without subjective scoring. JMC can be added to CAT. This would produce a standardized accurate, honest, and fair test compatible with the classroom.