The history of accountability from the school and to the state department of education level has been quit varied. Only after running this preposterous natural experiment for ten years is it being challenged in ways that may be effective in either bringing it to an end or in correcting its excesses. Congress created this absurd monstrosity by setting an impossible goal for all students to meet. It then reneged on its oversight responsibility to act in good faith to avoid many of the unintended consequences that occurred (it is now five years behind the time it should have acted on needed changes).
Self-regulation is a lofty idea. It has failed miserably for mortgages, for wall street derivatives, and in the futures market (all of which were presumably being regulated). The same can be expected in state departments of education that must come up with acceptable numbers to obtain federal funding. The two consortia (PARCC and SBAC) promoting the Common Core State Standards have the promise of serving as checks on one another. This is an expensive and ambitious political solution that may have its own down side depending on implementation. If all states will release actual student test scores there will be a way to determine how creative states are in determining passing rates.
The passing rate has been politically exploited in several states. New York is the prize example. Diane Ravitch posted, 21 February 2012, “Whence came this belief in the unerring, scientific objectivity of the tests? Only 18 months ago, New York tossed out its state test scores because the scores were unreliable. Someone in the state education department decided to lower the cut scores to artificially increase the number of students who reach proficient. No one was ever held responsible.”
Michael Winerip posted, 10 June 2012, “Though this may be the worst breakdown in 15 years of state testing, it does not appear that Florida politicians have any interest in figuring out who was responsible. The commissioner? Department officials? Someone at Pearson, the company that scored the writing tests?” Winerip reports further that, “The audit referred to lowering the passing score to 3 as ‘equipercentile equating’”. That is, the score was lowered until the same portion of students passed this year as passed last year. [As I am writing this, the commissioner resigned.]
As is the case with mortgages, derivatives, and futures, it is difficult, in most cases, to say if a crime has been committed or just very poor judgment was exercised, until the chain of events is carefully studied including the false "belief in unerring" test scores. My own explanation here is that research results and application results are not the same thing. In research you predict acceptable results. In application you examine the results for meaningful useful relationships (equipercentile equating to obtain the desired pass rate – any relationship between the ranking on the test and what students actually know or do is mostly coincidental near the cut score).
Has a crime been committed? In the case of New York, and other states that must now “explain” why student tests scores are dropping on new tests, I would say, “Yes.” For states that choreographered an almost perfect, ever slowing, increase in the pass rate for the past 8 to 10 years, the answer is problematic. It can range from outright cheating to self-deception. From equipercentile equating, to selecting test items that produce the desired results, the standard practice for classroom test score management.
On 18 May 2012, Valerie Strauss posted the white paper released by the Central Florida School Board Coalition. This lengthy paper details the unintended consequences and the downright sloppy test items used on their standardized tests. My own software, Power Up Plus (PUP) can pick such items out when run on a notebook computer in a matter of minutes. I am amazed that such items are used considering the millions of dollars spent on development and administration of their tests. I strongly suspect their developmental process.
Cory Doctorow posts, “The Test Item Specifications are the guidelines that are used to write the test questions. If the Science FCAT test is reviewed by the same Content Advisory Committee that reviewed the Test Item Specifications, then it probably has similar errors.” From my experience, a valid test item must assess exactly what it says (concrete level of thinking – what you see is what you get) or be an indicator of knowledge or skill of things in the same class (1 + 3 = 4 to assess addition of integers). Questions that have different right answers based on levels of thinking, socio-economic status, state, religion, politics, ethnicity, and current political correctness are not to be used. That is, stick to the topic, not to what the topic (agenda) or skill may be used for. Where a question is on topic but has different answers related to the above, it should stand. This is part of the broadening effect of education. A recent example is the Missouri constitutional amendment voted on yesterday to protect the religious rights of school children.
In the case of Florida, once again, faulty predictions were made based on some type of research. The entire system (instruction, learning, and assessment) was not fully understood or coordinated with disastrous results. And again, another state education official has resigned. Was this a crime or just a waste of millions of dollars and millions of instructional and learning hours?
On 24 April 2012 Valerie Strauss posted the National Resolution Protesting High-stakesStandardized Testing that is based on the Texas Resolution Concerning High Stakes, Standardized Testing of Texas Public School Students. These two resolutions combined with the Florida white paper make a strong political protest that may take years to obtain results. The desired results are not specifically stated. We are back to the days of “alternative assessment”: Do something different; but doing something different must be at least as good as what we have, or we lose again; as with the authentic assessment and the portfolio movements.
As well intentioned as all the people are working on this assessment problem, most are still riding their safe steady tricycle, the traditional force-choice multiple-choice test scored at the lowest levels of thinking, that they were exposed to long before it was used for NCLB testing. It actually worked fairly well back then when the average test score was 75% or higher. It has failed miserably when NCLB test score cut scores dropped below 50% (a region where luck of the day determines the rank of pass or fail). Only until these people are willing to get off their old tricycles will they have any interest in getting on a bicycle (where students can actually report what the know and can do).
Knowledge and Judgment Scoring allows students to individualize standardized tests, to select the level of thinking they will use; to guess at right answers, or to use the questions to accurately report what they actual know and can do. We only need to change the test instructions from, “Mark an answer on each question, even if you must guess” to “Mark an answer only if you can use the question to report something you trust you know or can do.” Change the scoring from “two points per each right mark and zero for each wrong mark” to “zero for each wrong mark, one point for omit (good judgment not to guess – to make a wrong mark) and two points for each right mark (good judgment and right answer)”.
We can now give students the same freedom as given with essay, project, or report to tell us what they actually know and can do. We also have the option to commend students as on other alternative tests, “You did a great job on the questions you marked. Your quality score of 90% is outstanding.” This quality score is independent of the quantity score. You can now honestly encourage traditionally low scoring students for what they can do rather than berate them for what they cannot do (or for their bad luck on traditional forced-choice tests).
The NCLB monster may be controlled by legal action (see prior post), political action, or by just changing its temper from a frightening bully to an almost friendly companion. Just select your breed: Knowledge and Judgment Scoring in PUP, Partial Credit Model in Winsteps, or Amplifier by Knowledge Factor.
Why continue just counting right marks; making unprepared liars out of lucky winners and misclassifying for remediation unlucky losers? We know better now. It no longer has to be that way. “Whence came this belief in the unerring, scientific objectivity of the [dumb, forced-response, guess] tests?” We need to measure what is important. We do not need to make an easy measurement (forced multiple-choice and essay at the lowest levels of thinking) and then try to make the results important (at higher levels of thinking). There is a difference in student performance between riding a tricycle and a bicycle. We cannot hold students responsible for bicycling if they only practice and test on tricycles.