Wednesday, October 31, 2012

An Assessment Worthy of the Common Core State Standards

The Common Core State Standards go beyond just knowing, believing and guessing. It demands an assessment that includes the judgment of psychometricians, teachers, and students. For the past decade, psychometricians have dominated making judgments from statistical information. The judgment of teachers was given equal weight in 2009 in Nebraska (see prior post).

The power of student judgment needs to be discussed and a way of adding it as the third primary stakeholder in standardized testing. Currently the old alternative and authentic movements are being resurrected into elaborate time consuming exercises. The purpose is to allow students to display their judgment in obtaining information, in processing it, and in making an acceptable (creative and innovative) report.

Traditional multiple-choice scoring, that only counts right marks, is correctly not included. Students have no option other than to mark. A good example is a test administered to a class of 20 students marking four-option questions (A, B, C, and D). Five students mark each option, on one question. That question has 5 right out of 20 students or a difficulty of 25%. There is no way to know what these students know. A marking pattern of an equal number of marks on each answer option indicates they were marking because they were forced to guess. They could not use the question to report what they actually trusted they knew. Student judgment is given no value in traditional right count scored multiple-choice testing.

The opposite situation exists when multiple-choice is scored for quantity and quality. Student judgment has a powerful effect on an item analysis by producing more meaningful information from the same test questions. Student judgment is given equal weight to knowing by Winsteps (partial credit Rasch model IRT, the software many states use in their standardized testing programs) and by Power Up Plus (Knowledge and Judgment Scoring, a classroom oriented program). Scoring now includes A, B, C, D, and omit.

Eight different mark patterns are obtained, related to student judgment, rather than two obtained from traditional multiple-choice scoring, when continuing with the above example. The first would be to again have the same number of marks and omits (4 right, 4 wrong, 4 wrong, 4 wrong marks, and 4 omits). This again looks like a record of student luck on test day. I have rarely seen such a pattern in over 100 tests and 3000 students. Experienced students know to omit for one point rather than to guess and get zero points when they cannot trust using a question to report what they actually know or can do.

The next set of three patterns omits one of the wrong options (4 right, 4 wrong, 4 wrong, and 8 omits. Students know that one option is not right. They cannot distinguish between the other two wrong options (B & C, B & D, and C & D). By omitting they have uncovered this information, which is hidden in traditional test scoring where only right marks are counted.

In the second set of three patterns students know that two options are not right and they can distinguish between the remaining right and wrong options. Instead of a meaningless distribution of marks across the four options, we now know which wrong option students believe to be a right answer (B or C or D). [Both student judgment and item difficulty are at 50% as they have equal value.]

The last answer pattern occurs when students either mark a right answer or omit. There is no question that they know the right answer when using the test to report what they trust they know or can do.

In summary, quantity and quality scoring allows students of all abilities to report and receive credit for what they know and can do, and also for their judgment in using their knowledge and skill. The resulting item analysis then specifically shows which wrong options are active. Inactive wrong options are not buried under a random distribution of marks produced by forced-choice scoring.

All four sets of mark patterns contain the same count of four right marks (any one of the options could be the right answer). Both scoring methods produce the same quality score (student judgment) when all items are marked (25%). When student judgment comes into play, however, the four sets of mark patterns require different levels of student judgment (25%, 33%, 50% and 100%).

Right count scoring item difficulty is obtained by adding up the right (or wrong) marks (5 out of 20 or 25%). Quantity and quality scoring item difficulty is obtained by combining student knowledge (right counts, quantity) and student judgment (quality). Both Winsteps and Power Up Plus (PUP) give knowledge and judgment equal value. The four sets of mark patterns then indicate item difficulties of 30%, 40%, 50% and 60%.

[Abler students always make questions look easier. Measuring student quality makes questions look easier than when just counting right marks and ignoring student judgment. The concept of knowledge and judgment is combined into one term, the location on a logit scale (natural log of the ratio of right to wrong marks), for person ability (and the natural log of the ratio of wrong to right marks for item difficulty) with Rasch model IRT using Winsteps. The normal scale of 0 to 50% to 100% is replaced with a logit scale of about -5 to zero to +5.]

Quantity and quality scoring provides specific information about which answer options are active, the level of thinking students are using, and the relative difficulty of questions that have the same number of right marks. IMHO this qualifies it as the method of choice for scoring Common Core State Standards multiple-choice items (and for preparation for such tests).

Forced guessing is no longer required to obtain results that look right. Experienced students prefer quantity and quality scoring. It is far more meaningful then playing the traditional role of an academic casino gambler.

Wednesday, October 24, 2012

Nebraska Student Assessment Four Star Rating

Nebraska is now poised for a fifth star. It has a standardized assessment that ranks, that measures, that produces meaningful student test scores, and is deeply influenced by the judgment of experienced classroom teachers. Nebraska is the first state I have found that has transparently documented that it is this close to an accurate, honest, and fair assessment.

Five critical features can rate the standardized tests used by state departments of education over the past ten years. These tests have evolved from just ranking students, teachers, and schools based solely on the judgment of psychometritians to including the judgment of teachers. 

A five star rating would include the judgment of students to report what they know accurately, honestly, and fairly instead of guessing at the best answer to each multiple-choice question.

A test can earn three stars based on the judgment of psychometritians, one star on the judgment of teachers, and one star on the judgment of students. These are the three main stakeholders in doing a standardized multiple-choice test.

There are other stakeholders who make use of and market the test results. These secondary stakeholders often do not market the true nature of the standardized test in hand. Their claims may not match the test results.

ONE STAR: Any standardized multiple-choice test earns one star. The norm-referenced test compares one student with another. Raw test scores are plotted on a distribution. A judgment is then made where to make the cut scores. Many factors can be used in making this judgment. It can be purely statistical. It can attempt to match historical data. It can be a set portion for passing or failing. It can be whatever looks right. The cut score is generally marketed with exaggerated importance.

TWO STARS: A criterion-referenced test earns two stars. This test contains questions that measure what needs to be measured. It does not compare one student with another. It groups students with comparable abilities. Nebraska uses below standard, meets standard, and exceeds standard. This divides the score distribution into three regions. Cut scores fall at the point a student has an equal chance of falling into either region. The messy nature of measuring student knowledge, skill, and judgment is transparent. Passing is preparing to meet the standard set for the median of the meets standard region, not just preparing to be just one point above the cut score.

THREE STARS: The much-decried right count scored multiple-choice test performs best with higher test scores than lower test scores. Right marks on tests scored below 60% are questionable. Test scored below 50% are as much a product of luck on test day as they are of student ability. We can know what students do not know. Psychometricians like test scores near 50% as they lend stability to the test data. Nebraska designed its test for an average test score of 65% plus questions needed to cover the blueprint requirements for a criterion-referenced test. The Nebraska standardized 2010 Grade 3 Reading test produced an average score of 72%. Nebraska can know what students do know about 3/4 of the time: Three stars.

FOUR STARS: Nebraska earns a fourth star of including teacher judgment in writing questions, in reviewing questions, and in setting the criterion-referenced standards. The three regions (below, meets, and exceeds standards) have meaning beyond purely statistical relationships. It was teacher judgment that moved the test design from an average score of 50% to 72%. The scores now look very much like those produced by any good classroom test. They can be interpreted and used in the same way.

FIVE STARS: Nebraska has yet to earn a fifth star. That requires student judgment to be included in the assessment system. When that is done, Nebraska will have an accurate, honest, and fair test that also meets the requirements of the Common Core State Standards.

Most right marks will also represent right answers instead of luck on test day (less churning of individual test scores from year to year). The level of thinking used by students on the test and in the classroom can also be obtained. All that is needed is giving students the option to continue guessing or to report what they trust they know.

*   Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).

Including student judgment will add student development (the ability to use all levels of thinking) to the Nebraska test. Students need to know and do, but also who have experienced judgment in applying knowledge and skills in situations different from those in which they learned.

Routine use of quantity and quality scoring in the classroom (be it multiple-choice, short answer, essay, project, or report) promotes student develop. It promotes the sense of responsibility and reward needed to learn at all levels of thinking (passive pupils become active self-correcting learners). IMHO if students fail to develop this sense of responsibility the Common Core State Standards movement will also fail

Software to do quantity and quality scoring has been available for over two decades. Nebraska is already using Winsteps. Winsteps contains the partial credit Rasch model routine that scores quantity and quality. 

Power Up Plus (PUP) scores multiple-choice tests by both methods: traditional right count scoring and Knowledge and Judgment Scoring. Students can elect which method they are most comfortable with in the classroom and in preparation for standardized tests.

Starting in 2005, Knowledge Factor has a patented learning system that guarantees student development. High quality students generally pass standardized tests. All three programs promote the sense of responsibility and reward needed to learn at all levels of thinking, a requirement to meet the Common Core State Standards.

Newsletter (posted 23 OCT 2012):

General References:

Roschewski, Pat (online 25 OCT 2005) History and Background of Nebraska’s School-based Teacher-led Assessment and Reporting System (STARS). Educational Measurement: Issues and Practice, Volulme 23, Issue 2, pages 9-11, June 2004. (accessed online 6 Oct 2012)

Rotherham, Andrew J. (July 2006) Making the Cut: How States Set Passing Scores on Standardized Tests. (accessed online 6 Oct 2012)

Wednesday, October 17, 2012

Your Standardized Consortium Test

Two consortia (PARCC and SBAC) are working again on tests that are beyond simple questions that can be answered at all levels of thinking. The questions will go through the usual calibration, equating, and bias free processes. And, to the best of my knowledge, they will continue to be right count scored, at the lowest levels of thinking.

Trying to assess 21st century skills (bicycling) with the same old tricycles (forced choice tests) seems rather strange to me. And more so when the test is to assess college and job preparedness. These tests are to do more than create a ranked scale on which a predetermined portion will pass or fail as has been used in past years. These tests are supposed to actually measure something about students rather than produce just a ranked performance on a test.

Trying to raise the level of thinking required on a test in the beginning of NCLB resulted in a lot of very clever questions. I have no idea if one could actually figure out why or how students answered the questions in respect to why they were on the test. On a forced choice test you just mark. On a quantity and quality scored test student responses fall into Expected, Guessing, Misconception, and Discriminating because students only mark when they trust they do know or can do – an accurate, honest and fair test is obtained with no forced guessing required.

Higher levels (orders) of thinking involve metacognition: the ability to think about one’s own thinking, the ability to question one’s own work, and the ability to be self-correcting. These abilities are assessed with quantity and quality scoring of multiple-choice tests. The quality score indicates the degree of success each student has in developing these abilities (when learning and when testing). The quantity score measures the degree of mastery of knowledge and related skills. It is not that they know the answer but that they have developed the sense of responsibility to function at all levels of thinking and can therefore figure out the answers.

May own experience has been that students learn metacognitive test taking skills quickly when routinely assessed with a test scored for knowledge and judgment. (Over 90% voluntarily switched from guessing at answers [traditional right count scoring] to quantity and quality scoring after two experiences with both). It took more than three to four times as long for them to apply these skills to learning: to reading or observing, with questions; to building a set of relationships that permitted them to verify that they could trust what they knew; to apply their knowledge and skill to answering questions they had not seen before.

The two consortia have to make a choice between beefing up traditional forced-choice multiple-choice tests or simply changing the test instructions so student can, continue with multiple-guess, or switch to reporting what they trust they know (quantity and quality scoring). I am not convinced that beefing up traditional forced-choice questions will produce the sought after results. The new questions must still be guarded against guessing as students are still forced to guess. The guessing problem is solved by letting students report what they trust they know using quantity and quality scoring – no guessing required.

Two sample items from SBAC show how attempts are being made to improve test items. A careful examination indicates that again, we are facing clever marketing.

“Which model below best represents the fraction 2/5?”

“Even if students don’t truly have a deep understanding of what two-fifths means, they are likely to choose Option B over the others because it looks like a more traditional way of representing fractions. Restructuring this problem into a multipart item offers a clearer sense of how deeply a student understands the concept of two-fifths.”

The word “best” is a red flag. Test instructions often read, “Mark the best answer for each item.” It means: Guess whenever you do not know, do not leave an item unmarked. Your test score is a combination of what you know and your luck on test day. Low ability students and test designers are well aware of this as they plan for each test.

“Best” in the item stem is also a traditional lazy way of asking a question. A better wording would be “is the simplest representation of”. There would then be just one right answer for the right reason: “the simplest representation” rather than “a more traditional way of representing”. Marketing. I agree that the item needs to be restructured or edited.

“For numbers 1a-1d, state whether or not each figure has 2/5 of its whole shaded.”

“This item is more complex because students now have to look at each part separately and decide whether two-fifths can take different forms. The total number of ways to respond to this item is 16. ‘Guessing’ the correct combination of responses is much less likely than for a traditional four-option selected-response item.”

The comment states that students must now “look at each part separately and decide” each of four yes/no answers. The item may be more complex to create with four answers but the answering is simpler for the student. Marketing.

Grouping four yes/no answers together to avoid the chance score of 50% is clever. The 2x2x2x2 (16) ways would become 3x3x3x3 (81) ways using quantity and quality scoring (if students were to mark at the lowest levels of thinking)! The catch here is that the possible ways and the probability of those ways are not the same thing. It is the functional ways, the number of ways that draw at least 5% of the marks that matters. If only four ways were functional on the test, then all of the above reduces down to a normal four-option item. Scoring the test for quantity and quality eliminates the entire issue as forced-guessing is not required when students have the opportunity to report what they trust accurately, honestly, and fairly. If you do not force students to guess, you do not need to protect test results from guessing.

As I understand how this item will be scored, it condenses four items into one for a reason that is not entirely valid: guessing control. The statement that “students now have to look at each part separately” is presented in such a way that it implies they would not “have to look at each part separately” on the first example. Marketing again. Since there is no way to predict how an item will perform, we need actual test data to support the claims being made.

These two examples are not unique in striving for higher levels of thinking assessment by combining two or more simple items into a more complex item. I dearly love the TAKS question that Maria Luisa Cesar included in her San Antonio Express-News article, 4 December 2011, 1B and 3B. Two simple questions along the line of, “Is this figure: A) a hexagon, B) an octagon, C) a square, D) a rectangle" have been combined.

I was faced with this question on my first day in school with, “Color each of the six circles with the correct color.” I did not know my colors. I had six circles and six crayons. I lined up the crayons of the left side of my desk. After coloring a bit of each circle with a crayon, I put it on the right side of my desk. I had colored each circle with the correct color.

The same reasoning would get a correct answer here without knowing anything about hexagons or octagons: the figures are not the same. That leaves 7 sides and 5 vertices. Seven sides is not correct. So 5 vertices must be correct, whatever a “vertice” is.

The STAAR question figures are composed of vertices (4, 6, 5, 6), faces (5, 6, 5, 4), and edges (5, 9, 8, 9). A simple count of each yields a match only with option C. No knowledge of the geometric figures is required at the lowest levels of thinking.

The problem here is that the question author was thinking like a normal adult teacher. It took me a couple of years using quantity and quality scoring (PUP) to make sense of the thinking students use when faced with a test question. I divided the sources of information that students used into two parts. One part is what students learned by just living: robins have red breasts and blue eggs. The other part is what they have learned in a reasoned, rational manner. These are roughly lower and higher levels of thinking, recall and formal, or passive and active learning.

On top of this is the human behavior of acting on what one believes rather than on what one knows. Here we are at the source of misconceptions that are very difficult to correct in most students and adults (teachers and teacher advocates have a pathological bias against free-enterprise when at the same time it generates the funds for their employment [and solves problems the educational bureaucracy fails to solve]. They also have an inability to relearn to use a multiple-choice test to assess what students actually know rather than using it to just rank students).

In summary, improving assessment by taking the old tricycle and adding dual wheels with deeper treat (multitasking and multiple part items) is really not enough. It is time to move on to the bicycle where the student is free to report what is trusted as the basis for further learning and instruction (spontaneous student judgment replaces that passive third wheel – waiting for the teacher to perform and correct). 

And even more important is to create the environment in which students acquire the sense of responsibility needed to learn at higher levels of thinking. Scoring classroom tests for knowledge and judgment (PUP and Partial Credit Rasch Model) does this: it promotes student development, as well as, knowledge and skill. Only when struggling students actually see and can believe they are receiving credit for knowing what they know rather than for their luck on test day, have I seem them change study and test taking habits.

Kaitlyn Steigler sums it up nicely in an article by Jane Roberts, “It used to be, I do, we do together, now you do.” “Now, the kids will take charge. The teaching will be based on what we figure they know or don’t know.” PUP scores multiple-choice tests both ways, so students can switch to reporting what they trust when they are ready. Then self-correcting students, as well as, their teachers will know what they know when they are learning, during the test, and as the basis for further learning.

Wednesday, October 10, 2012

Your Standardized State Test

A standardized state test is created in the same way as a standardized classroom test (see prior post) with a few exceptions: 1. The initial questions are calibrated and field-tested. 2. The mastery questions are rejected (a standardized state test is not concerned with what students actually know because of the following exception). 3. Only discriminating items are selected for the operational test, to make the operational test as powerful as possible with the fewest number of items, for ranking schools, teachers, and students.

State test results can be parsed by inspection of what happened, at all levels of thinking, in the same way as a classroom test, using scores and portions of scores. However state test results are usually parsed based on standardized expected scores and portions of scores. A set of common items is sprinkled through the distribution of each test. If the common items perform the same on both tests, then the two tests are declared of equal difficulty. Unfortunately this practice does not work like pixie dust. The common items sometimes fail. Florida in 2006 suffered a marked increase (highest value ever reported on the Grade 3 FCAT Reading SSS) followed by a marked decrease the next year in 2007.

How state test results are reported to the public has, therefore, evolved from risky raw scores, to percent passing, to increase over last year, to fairly safe equipercentile equating. (This creativity carries on into how states inflate their educational progress based on their standardized test results.) A method of transition equipercentile equating was initiated with the Grade 3 FCAT 2.0 Reading (2011) test to help solve problems created by ranking students on traditional forced-choice, right count scored tests, when introducing a new form of a test. It is rather clever marketing.

·      2010 Last FCAT test reported in old achievement levels.
·      2011 First FCAT 2.0 test reported in old achievement values.
·      2012 First FCAT 2.0 test reported in new achievement levels.
·      2012 Second FCAT 2.0 test reported in new achievement levels.

FCAT Reading – Sunshine State Standards
% Achievement Level
Level 3-5
2010 old
2011 old
2011 new
2012 new

“The scores are being reported in this way to maintain consistent student expectations during the transition year.” There is no delay in publishing results in the transition year of 2011; just parse the 2011 scores with the 2010 achievement levels. The Department of Education then has one year to get all interested parties together to create the new achievement levels.

 “Although the linking process does not change the statewide results for this year [2011], it does provide different results for districts, schools, and students.” The 2012 results confirm the 2011 results. This looks good. This stability is highly prized as an indicator that the Department of Education is doing a good job in a difficult situation.

However when equipercentile equating was used on the Grade 4 writing test it created a furor. When announced in advance with lots of time for all interested parties to maneuver, equipercentile equating was acceptable on the Grade 3 reading test. When applied as a stopgap measure on the Grade 4 writing test, it failed. The rankings from test scores are therefore a very political matter: the right portion must pass and fail rather than the test being a measure of some identified student ability.

The Center on Education Policy (CEP) sent an open letter to the member states of SBAC and PARCC, 3 May 2012 suggesting: “Routinely report mean (average) scores on your assessments for students overall and for each student subgroup at the state and local levels, as well as across the consortium. This should be done in addition to reporting the percentages of students reaching various achievement levels.” We need creative teaching, not creative manipulation of test results.

In conclusion, standardized state tests are now much closer to standardized classroom tests. Reasonable attempts are made to select questions that will produce a workable distribution for ranking students, teachers, and schools. The classroom teacher is replaced with committees of experts. The test results are then inspected to see what happened by another set of committees of experts just as a teacher would inspect classroom results at all levels of thinking. (The state has one year to do what a classroom teacher does in one hour.)

The largest remaining failure in all of this, IMHO, is that all of this work is being done using a scoring method that functions at the lowest levels of thinking: the right count scored multiple-choice test. Although examiners are now giving themselves the opportunity to use their best judgment, at all levels of thinking, in interpreting test scores (as classroom teachers always have), they have yet to give students the opportunity to use their best judgment, at all levels of thinking, to mark answers they trust as the basis for further learning and instruction.

To obtain accurate, honest and fair results, students must be given the opportunity to report what they trust – no guessing required. It only takes a change in test instructions. PUP, Winsteps, and Amplifire can score a multiple-choice test at all levels of thinking. If we want students to be skillful bicycle riders, we must stop testing them only on tricycles.

Wednesday, October 3, 2012

Your Standardized Classroom Test

A standardized classroom test makes a neat model for state and consortium standardized tests. All you need is an easy way to produce multiple-choice questions and the proper test scoring software.

I used True Test Writer (Version 2.06, copyright 2002-2004, is still available free on request). It had the ability to randomize both, answer options and test items, and select one of two right answers. Now there are several more advanced test writer programs that include Internet features listed at Educational Software Cooperative, non-profit.

Multiple-choice questions are easily based on a standardized paragraph:

  •  Introductory sentence.
  • Three or more descriptive sentences, charts, tables, pictures, sketches (what is and is not related). 
  • Summary sentence.

        These become:

  • Edited to be the first right answer.
  • Edited to be wrong answers (what is not acceptable).
  • Edited to be the second right answer.

Now there is no way to predict how any question will perform or if students will answer it for the reason you put it on the test. As a rule of thumb, about 2/3 of your questions will be of value in determining what students know and can do and of value in assigning grades. I fielded (placed on the test) about 50 questions for a “one hour” test.

Score the test, for quantity and quality, knowledge and judgment, for the most useful information. Examine the Mastery, Unfinished, and Discriminating items.

Discard items that failed to perform well and keep good items. You need enough mastery items to produce an average score of about 75% and to survey what all of your students really know or can do. You need about 10 Discriminating items to yield a good grade spread: a mean of 75% +/-10% standard deviation and to reveal student groups that know from student groups that do not know. Now eliminate unfinished items unless the item is one that students really should have been able to answer based on your judgment and experience with this class.

With Power Up Plus (PUP), edit the deletions on Sheet 2, and click Score for standardized scores. If you use active testing (rather than passive testing where once done there are no changes) you can edit deletions on Sheet 2 and click Score again after discussing the Discriminating and Unfinished items in class. PUP has the provision to give every student in the class a point when the class discussion seems very productive or the item was a terrible waste of time.

PUP provides information that assesses your teaching, the test questions, and student performance. You need only print out and post the final scores after deletions and adjustments. You decide what needs to be re-taught to the class and which students need individual attention. Scoring for quantity and quality (PUP, Winsteps and Amplifire) gives you the advantage of knowing the level of thinking students are using and their academic maturity, their development in assuming responsibility for learning. PUP scores both traditional right count, and quantity and quality. Each student can make that choice. [Most will give up their old tricycle (guessing) once they have some experience with their new bicycle (reporting what they trust).]

The above sequence of question writing, student response, selecting the “true test” within the fielded questions on the test, and adjusting the cut score is a good model of how states and national standardized tests are conducted. There is one big difference: Students have about a one letter grade advantage taking your tests over tests created by other writers. You are more on target and you are using the vocabulary your students are familiar with. So beware! Your students need at least a one-letter grade handicap. Your just passing in class will be just failing on standardized tests.