The Alaska Reading Standards Based Assessments contain three features worthy of a star. In 2011, they show a matched comparison analysis that provides an insight into the dynamic nature of student assessment. In 2001, they also contain traditionally set cut scores and questions that are easy enough to provide actually measurement of what students know and can do.
ONE STAR: Alaska recorded the scores of students who obtain an increased, decreased or the same score (stable) this year as last year on the reading test for 2008-2009, 2009-2010 and 2010-2011 in a matched comparison analysis. The charts present static and dynamic views.
The portion of students in the Far Below Proficient and Below Proficient Stable group remained the same for all three comparisons. The portion of students in the Proficient and Advanced Stable group show a very small decline from year to year. The portion of students showing a decrease in performance matched the portion showing an increase in performance. This is a static view.
The dynamic view shows much more is going on in this assessment system. The reason the two above Stable views were stable is that about the same number of students who tested Below Proficient last year, this year tested Proficient (improved in proficiency), and the same number who tested Proficient last year, tested Below Proficient this year (decreased in proficiency).
This balanced exchange also took place between Proficient and Advanced levels of performance. In total, about 26% of all students changed proficiency levels each year (about 6% of the students crossed each of the two cut scores in both directions).
There are several reasons for this churning. The most obvious is variation in student preparation from year to year (any one set of questions will match one portion of the students better than the rest of the examinees). Another is how lucky each student was on test day. This brings up test design.
TWO STARS: The Alaska test compares student performance (norm-referenced). This is the most common and least expensive way to create a standardized test. It also forces students to mark answers even when they cannot read or understand the questions. This is called right count scoring, the traditional way of scoring classroom tests. It produces a score that can be used to validly rank student performance.
THREE STARS: The 2001 Alaska Technical Report, page 18, shows the average test scores for Reading ranged from 67% to 72% for grades 3, 6, and 8. Scores above 60% can indicate what students actually know and can do rather than their luck on test day. (The publication of average raw test scores is now considered essential to permit validation of the test results and comparison with other states using the same Common Core State Standards test.) [The Spring 2006 Alaska Standards Based Assessments, Chapter 8, did not list the average raw test scores: no star.]
SCORE VARIATION: The 2001 report, page 25, also shows the standard error of measurement (SEM), an estimate of where each student’s score would land on the cut score divided distribution, if the student could repeat the test. The example for Reading grade level 3 shows that 2/3rds of the time the repeated test scores of student “A” would fall within the range of 388 and 442 scale score units (415 original score +-27 SEM). That is 27/351 or 7.7% of the test mean, or 27/600 or 4.5% of the full-scale score. (The SEM is derived from the test reliability and the standard deviation in scale score units. A smaller, more desired, SEM can be produced by a higher test reliability and a lower standard deviation.)
The standard deviation, of the raw scores and the scale scores, provides a more direct view of the variation in the student test scores, page 18. The standard deviation is the sum of the deviations of each student score from the test mean, that is squared, and is then divided by the number of scores (variance) which is then returned to a normal number by obtaining the square root (squaring makes all the deviations positive values otherwise they would add up to zero).
The average standard deviation for the nine, grade 3, 6, and 8, test raw scores was 8.8/30.1 or 29% of the test means; that is, 2/3rds of the time a student with an average score of 30.1 would be expected to have repeated test scores fall between 30.1 +-8.8 or 21.3 to 38.9 on a test with 42 points total. Converting all of this into log ratio (logit) units used by psychometricians produces slightly different results.
The average standard deviation for the nine, grade 3, 6, and 8, test scale scores was 83/349 or 24% of the test means; that is 2/3rds of the time a student with an average scale score of 349 would be expected to have repeated scale scores fall between 349 +- 83 or 266 to 432 on a scale score range of 500 points (100 to 600).
Both SEM and standard deviations show a large amount of uncertainty in test scores. The documentation of this churning is worth a third star. This inherent variation in an attempt to capture student performance in a number accounts for much of the churning observed from year to year. Scoring these tests for quantity and quality instead of just counting right marks would yield much more useful information in line with the philosophy of the Common Core State Standards.
THREE OTHER STARS: Alaska places emphasis on cut scores on a single score distribution (norm-referenced). Nebraska (see previous post) places emphasis on two other score distributions (two stars): It groups scores both by asking questions needed to assess specific knowledge and skills (criterion-referenced) and by teacher judgment into which group each student they know well fits. Cut scores fall where a student score has an equal probability of falling into either group.
Both Alaska and Nebraska have yet to include student judgment in their assessments (one star). When that is done, Alaska will have an accurate, honest, and fair test that better matches the requirements of the Common Core State Standards.
Most right marks will also represent right answers instead of luck on test day and less churning of student performance rankings. The level of thinking used by students on the test and in the classroom can also be obtained. All that is needed is to give students the option to continue guessing or to report what they trust they know.
* Mark every question even if you must guess. Your judgment of what you know and can do (what is meaningful, useful, and empowering) has no value.
** Only mark to report what you trust you know or can do. Your judgment and what you know have equal value (an accurate, honest, and fair assessment).
Including student judgment will add student development (the ability to use all levels of thinking) to the Alaska test. The Common Core State Standards needs students who know and can do, but also who have experienced judgment in applying knowledge and skills.
Routine use of quantity and quality scoring in the classroom promotes student develop. It promotes the sense of responsibility and reward needed to learn at all levels of thinking, a requirement of the Common Core State Standards.
Software to do quantity and quality scoring has been available for over two decades. Alaska is already using Winsteps. Winsteps contains the partial credit Rasch model routine that scores quantity and quality.
Power Up Plus (PUP) scores multiple-choice tests by both methods: traditional right count scoring and Knowledge and Judgment Scoring. Students can elect which method they are most comfortable with in the classroom and in preparation for Alaska and Common Core State Standards standardized tests.
Starting in 2005, Knowledge Factor has a patented learning system that guarantees student development. High quality students generally pass standardized tests. All three programs promote the sense of responsibility and reward needed to learn at all levels of thinking, a stated requirement of the Common Core State Standards movement.