Wednesday, August 24, 2011

Standardized Testing - Structure, Function, and Operation

The structure, function and operation of standardized testing must all be considered when evaluating the usefulness of test results. Standardized test results are not always what they are claimed to be. When mixed with politics, they usually have even less value, as will be discussed near the end of this post.

Standardized testing involves test score distributions (statistical tea leaves). Their two most easily recognized characteristics are the average score, or mean, and the spread of the distribution, or standard deviation (SD).

Two methods of obtaining score distributions are now in use. The traditional method, counting right marks on a multiple-choice test, is the same as used on most classroom tests. The Rasch model method, used by many state education departments, converts test results to estimated measures of student ability and of item difficulty.

The value of multiple-choice test results depends upon how the test is administered. Both methods allow for two modes: forced choice, mark every answer, and student choice of items that can be used to report what the student trusts that is useable as the basis for further learning and instruction.

The following table relates the above four combinations to two software programs, the fixed, reproducible, structures that produce score distributions. Power Up Plus (PUP) and Winsteps produce score distributions for classroom use and for standardized testing. 

Three of the four modes produce traditional right count quantitative score distributions: Quantity Scores. KJS adds a quality score that is comparable to the full credit mode measure distribution.


The distribution of scores from a traditional multiple-choice test can be a good indicator of classroom performance (teacher and student). As a standardized test, only counting right marks places as much, if not more, emphasis on test performance as on student performance. Items are carefully selected to produce a predicted score distribution. This score distribution is expected to match some subjectively set standard (cut score) such as grade level or job readiness. But how the test is administered changes the value and meaning of key functions. Forced choice and student choice produce two different views from the same students.


For many historical reasons, including tradition and short-term accountability, NCLB has used the forced choice mode that only assesses and promotes the lowest levels of thinking. It is fast, cheap, and ineffective. Testing, and unfortunately as a result teaching, limited to the lowest levels of thinking is more counter productive the longer students are exposed to it. This may be an underlying factor in the poor showing made by high school students, in general, in relation to the lower grades (the spread between the levels of thinking required and that seniors possess my contribute to the current emphasis on senior attitude).

When students are allowed to report what they trust as a basis for further learning and instruction, a wealth of information becomes available for student counseling to direct student development. PUP allows students to switch from forced choice to reporting what they know when they are comfortable doing so.  Knowledge Factor is a patented instructional/assessment system that guarantees mastery learners. Development to use all levels of thinking is critical to success in school and in the workplace.


Many ways of operating standardized testing have been used in assessing students for NCLB. Multiple-choice was derided at first and then returned as the primary method. Almost everything that is not assessed by actual performance can be usefully measured with multiple-choice (A, B, C, D and omit). Traditional multiple-choice was crippled by dropping the option to omit (don’t know) early on. Just counting right marks was easier and gave a useable ranking for grading. How the rank relates to what a student knows or can do is still an open debate. Knowledge and Judgment Scoring settles this matter with a quality score.

A test maker (teacher or standardized item author) has all of the above structure and function options to consider when creating an operational test. The value of the final test results depends upon how the options are mixed and handled (a simple ranking or an assessment of what is known and can be done along with the judgment to use it well).

Test banking can be very simple. It can be a list of 25 questions that is edited each semester. The test is then scored by any one of the above four modes. The choice depends upon the use of the results. RMS ranks students and permits comparing your success from year to year. KJS and the partial credit Rasch model explores which students are still lingerers, followers and self-directed learners. The quality score can point out what each student knows or can do as the basis for further learning and instruction regardless of the test score.

Test banking can be very complicated, time consuming, and expensive.  Winsteps appears to be about the least complicated, the least time consuming and the least expensive way for standardized testing. It has been used by many states.

A test bank is created from items that have been calibrated by Winsteps. A high scoring sample will produce items with low difficulty. A low scoring sample will produce items with high difficulty. Equating, with the use of a set of common items, can bring these together if the two samples are believed to be from the same population. Winsteps does not know to do this on its own. When and how to equate requires an operational decision.


However the operations are carried out, human intervention is needed to start it and thereafter at about every other step. Standardized testing is still a mix of art, science and politics.  

A benchmark test is selected from the test bank. A range of item difficulties is selected to match the population to be assessed. A small common item set is included. The mean and standard deviation of the predicted distribution are calculated.  Time and money permitting, the benchmark test is administered one or more times. Now a known mean and standard deviation are in hand for the distribution. This ends research.

An application test is administered to the full population: every Algebra I student in the state, for example. This operational test also contains a set of common items used in creating the benchmark test. Winsteps scores the application test.

Resolution of the test results is not the same as equating items for a test bank. Winsteps can be used here in the same manner as in test banking, but the environment is now very different. A pre-application public declaration of cut scores is no longer recommended due to newly found (Feb 2011) sources of score instability. If the operational test has not performed as expected, the needed adjustment can favor the desired performance for the average score, the cut score, the scaled score, the percent passing, or the percent improvement. Public exposure of average scores has been requested by the Center on Education Policy (CEP), Open letter to the member states of PARCC and SBAC, May 3, 2011. Everyone can then know the starting point for whatever resolution adjustments are made. This would help reestablish public trust and increase the value of test results.

Test banking data can be liberally culled to obtain the best fit of data to the Rasch model because of the unique properties of the model. That same liberal attitude is, in my opinion, not justified when manipulating the operational test results.

The final step for Winsteps is the conversion of measures to expected raw scores. The conversion is a matter of changing log units to normal units when the test results are not manipulated. No human judgment is required. A normal bell curve distribution is again created.

This brings this series of posts related to the high jinks exposed in several state education departments to an end. Over the past few years several states have displayed marked deficiencies in their short-term competition for federal money and adequate yearly progress (AYP) including Texas and Illinois (part of the motivation for this year long investigation into Rasch model IRT test analysis). During this last year New York presented the worst example I know of. In my opinion the recent cheating scandals in Georgia will have done less damage to students, teachers and schools than the manipulation of New York state test results by state officials.

Arkansas, on the other hand, has posted almost perfect examples for AYP on NCLB tests for over a ten-year period: 2001-2011 End-of-course Comparison.

(The percent combined proficient and advanced is a derived value. Average test scores, and related cut scores, are based directly upon student marks on the test.)

This demonstrates exceptional skill in managing test performance. Such a performance has therefore invited suspicions of the test becoming more standardized on test performance (the test score) than on student performance (what students know and can do). Were that to be true, it would make Arkansas a good case of successful well-intentioned self-deception, created by instruction (curriculum), learning (level of thinking) and assessment (test items) being optimized for NCLB test results. These doubts are probably not valid given the awards won and leadership demonstrated by Arkansas. Comparison with NAEP also shows that two different views of the same students can vary a great deal. Both views may be validated with sufficient student performance information to clarify what each test is testing. Arkansas has also equated classroom and state test scores as part of their management of grade inflation (again, two views of the same students).

Replacing the national academic lottery conducted with right count scored tests with tests that actually assess what students know and can do, as the basis for further learning and instruction, is one way of clarifying this situation (Knowledge and Judgment Scoring and the partial credit Rasch model, for example). The same tests now used for ranking can also be used when upgrading classroom testing (to assess both quantity and quality) to better prepare students for whatever forms of questions are used on the new NCLB tests. There is a great increase in useful information for students and teachers to direct classroom assignments and activities at all levels of thinking. Or replace the classroom with a complete instruction/assessment package like Amplifire.

The spread of certified competency-based learning may help bring about the needed change in assessment methods. A test must measure what it claims it is measuring. The test results must not be subject to a variety of secretive factors that only delay the inevitable full disclosure. “You can fool part of the people (including yourself) part of the time, but not all of the people all of the time.” The software packages are honest. It is how they are used that is open to question.

No comments:

Post a Comment