Wednesday, November 28, 2012

A Balanced Common Core State Standards Assessment


It is time for psychometricians, teachers, and students to get on the same track with the same unit of measurement (not motorcycles, bicycles, and tricycles). Psychometricians have been top dog, feared, secretive and their judgment unquestioned. Teachers have worked hard, but to my current knowledge, only in a case like Nebraska has their judgment made a meaningful improvement in test results. Students have been treated as inanimate commercial commodities.

Optimum test results can only be obtained when the playing field is leveled for all three stakeholders. It is currently optimized from the view of psychometricians who have been strongly influenced, at times, by political power, and more often silenced by golden handcuffs. The “anomalies” that have become public and then retracted (more than once in Florida) show us the fruits of one-stakeholder rule in student performance assessment.

And now we have the Common Core State Standards tests. Students would like an honest, accurate and fair test. Teachers and students would like to know what each student knows and can do and what each one has yet to learn. Psychometricians would like highly reproducible test results, which do not require (present the opportunity for) equating test results (exposing error in selecting test items of equal difficulty) from year to year, but do present the appearance of equal difficulty.

And then we have the secondary level stakeholders who demand (and who fund with millions of dollars) the test results, only be in the form of a ranking, that shows improvement each year. They also want to do this at the lowest cost. To date the secondary level stakeholders have held the field.

Why things are as they are is then not too difficult to understand if you ignore the marketing that often overstates what is actually being done. Assessments carried out as forced activities cannot produce a valid indicator of what students actually know and can do. Such tests can produce a valid statistical ranking for satisfying a state or federal law. And that is why and how the tests have been funded.

The Common Core State Standards movement suggests that the judgment of all three primary stakeholders is included and respected. No one party is to triumph over or manipulate the other two parties. This demands some changes in the way they interact.

Students should be given the option of exercising their judgment in responding to test elements. This is inherent in classroom folders. It is also present when students have the option to respond to 5 essay items out of 7 to 10 suggested on a test. An in the alternative form of multiple-choice (quantity and quality scoring) students select questions to report, what in their judgment, they trust they know or can do.

Teachers should be given the option of exercising their judgment in writing test items that provide insight into what students are learning from what they are teaching. This includes both subject matter and skills, and student development. Teachers should be able to report, based on their judgment, which group each student best fits such as below, meets, and exceeds standards, as in Nebraska. Taken together, these inputs capture in numbers the climate of the classroom.

Psychometricians must respect the needs of the other two stakeholders. The oversimplification of data collection and data reduction to obtain the highest possible (but questionable) test reliability needs to become a part of the history of a natural experiment (NCLB) that has gone on too long. What works nicely in the safety of the research laboratory cannot be directly applied to individual student performances and obtain meaningful results (other than a ranking).

IMHO the Common Core State Standards movement demands the inclusion of more of the classroom climate (instruction, learning, feedback) than what forced test student performances yield. The student must be given the option to report what is meaningful, useful and empowering. The mechanics are simple for the student: know and don’t know; can or can’t do. Mark an option, select a question, or perform a task when in your mind you can trust what you are doing (and that this can be used as the basis for further learning and instruction). 

Students want to succeed. Teachers want them to succeed. Psychometricians need to capture what students and teachers have accomplished by letting students report knowledge, skills, and judgment. Quantity and quality scoring captures all three. Forced performances capture only part of knowledge and skills.

This has been a long introduction to three charts that summarize the psychometrician’s view of a standardized test. The first view is the result of over simplifying the classroom environment. Only right marks are counted on multiple-choice tests, or right stuff (generally restricted to rubrics) is counted on other forms of assessment. A raw score distribution is divided into three to five parts with cut scores. This is purely a statistical concept that works with any sample of anything. Once you have it in hand, the next job is to ascribe meaning to it based on each psychometrician’s judgment. The data from Alaska indicate that about 1/4 of the time students of equal abilities switch categories from year to year. This is a sizable measurement error related to right mark scoring.
The second view includes teacher judgment (see Nebraska posts). The single distribution is now teased apart into three. The average test score is no longer 50% but near 70%. The three score regions (below, meets, and exceeds standards) now have meaning based on teacher judgment (standard deviation of 20%, for example). 


The third view includes student judgment to report what is actually known and can be done that is the trusted basis for further learning and instruction. This is what the Common Core State Standards movement states is now needed. This chart is speculative. I have no actual data for it. I do know from working with over 3000 students that the portion of a test score distribution below 50% almost vanishes with quantity and quality scoring. Also the variation (the standard deviation) is lower, giving better separation of students grouped by performance (standard deviation of 10%, for example).


The psychometrician’s view is simple, cheap and often illusionary. The teacher’s view becomes more meaningful. The student’s view completes a balanced assessment system.

In summary, the Common Core State Standards movement now demands a far better test scoring and analysis than used in the past. In the case of multiple-choice tests, the switch from right count scoring to quantity and quality scoring only involves a change in test instructions that permit each student to elect which method should be used to score the test (see prior posts). The test then yields results that students, teachers and psychometricians can, all together, agree looks right.

Software to do this has been in existence for over two decades. Winsteps (partial credit Rasch model IRT) and Power Up Plus (Knowledge and Judgment Scoring) are two examples. Winsteps has been a popular program for state departments of education during the NCLB decade (they only need to change test instructions to assess student judgment).

Power Up Plus (PUP) is a classroom friendly program developed to provide students a means to frequently report accurately, honestly, and fairly what they actually knew and could do that was of value to themselves. They used the test results to guide further learning. I used the test results to guide my instruction and their development (passive pupil to self-correcting high achiever).

What all of this comes down to is an inversion of the present hierarchy:
  1. Let students have the opportunity to earn a quality score of 80-90% regardless of the quantity score. Let students report what they really know and can do.
  2. Let teachers submit questions that have shown in the classroom to meaningfully group students by their understanding, ability, skill, and development. These are questions that measure something important: mastery, misconceptions, reasoning errors and etc. Also let teachers estimate student test performance (below, meet, and above standards) as a part of each standardized test.
  3. Let psychometricians do their best with counts that are based on real students and classrooms rather than conducting an academic game show. The current statistical concept for ranking students is IMHO an even less perfect match to the Common Core State Standards movement than to the NCLB standards.
This is one way to produce a balanced assessment system. The standardized test items grow from all learning experiences. Students are free to make an accurate, honest, and fair report. Psychometricians are free to moderate a meaningful assessment process.


Please encourage Nebraska to allow students to report what they trust they know and what they trust they have yet to learn. Blog. Petition. We need to foster innovation wherever it may take hold.

No comments:

Post a Comment