Wednesday, October 30, 2013

Growth Mindset

The article by Sarah D. Sparks,, starts with a powerful concept: “It’s one thing to say all students can learn, but making them believe it – and do it – can require a 180-degree shift in student’s and teacher’s sense of themselves and of one another.”

The General Studies Remedial Biology course I taught faced this challenge. The course was scheduled at night for three consecutive hours in a 120-seat lecture room. I refused to teach the course until the following arrangements were made:
  • The entire text was presented by cable online reading assignments in each dormitory room and by off-campus phone service.
  • One hour was scheduled for my lecture, after any student presentations related to the scheduled topic. 
  • One hour was scheduled for written assessment every other week.
  • One hour was scheduled for 10-minute student oral reports based on library research, actual research, or projects.

Students requested the assessment period be placed in the first hour instead of the second hour, after the first few semesters. This turned the course into a seminar for which students needed to prepare on their own before class.

Only Knowledge and Judgment Scoring (KJS) was used the first few semesters, with ready acceptance by the class. The policy of bussing in students from out of the Northwest Missouri region brought in protestors, “Why do we have to know what we know, when everywhere else on campus, we just mark, and the teacher tells us how many right marks we made?”

Offering both methods of scoring, traditional multiple-choice (TMC) and KJS, on the same test solved that problem. Students could select the method they felt most comfortable with; that matched their preparation the best.

The student presentations and reports were excellent models for the rest of the class. They showed the interest in the subject and the quality of work these students were doing to the entire class.
KJS provided the information needed to guide passive pupils alone the path to becoming self-correcting scholars. As a generality, that path took the shape of a backward J. First they made fewer wrong marks, next they studied more, and finally they switched from memorizing non-sense to making sense of each assignment.

Over time they learned they were now spending less time studying (reviewing everything) and getting better grades by making sense as they learned; they could actually build new learning on what they could trust they had learned. They could monitor their progress by checking their quality score and their quantity score. Get quality up, interest and motivation increase, and quantity follows.

The tradition of students comparing their score with that of the rest of the class to see if they were safe, or needed to study more, or had a higher grade than expected when enrolling in the course (and could take a vacation), was strong in the fall semester with the distraction of social groups, football and homecoming. The results of fall and spring semesters were always different.

There was one dismal failure. With the excellent monitoring of their progress in the course, the idea was advanced to recognize class scholars. These students, had in one combination or another of test scores and presentations, earned a class score that could not be changed by any further assessment. They had demonstrated their ability to make sense of biological literature (the main goal of the course, which, hopefully, would serve them well the rest of their lives, as well as, the habit of making sense of assignments in their other courses). The next semester all went as planned. Most continued in the class and some conducted study sessions for other students.

The following semester witnessed an outbreak of cheating. Today, Power Up Plus (PUP) gets its name by the original cheat checker added to Power UP. Cheating became manageable by the simple rule that any answer sheet that failed to pass the cheat checker would receive a score of zero. I offered to help any student who wished to protest the rule to the student disciplinary committee. No student ever protested.

[Cheating was handled in-class as any use of the university rules was not honored by the administration. You must catch individual students in the act. Computer cheat checkers had the same status as red light cameras do now. If more than one student is caught, the problem is with the instructor, not with the student. We cancelled the class scholar idea.]

We need effective tools to manage student “growth mindset”. The tools must be easy to use by students and faculty. Students need to see how other students succeed, to be comfortable in taking part, and be able to easily follow their progress when starting at the low end of academic preparation of knowledge, skills, and judgment (quality, the use of all levels of thinking).

A common thread runs through successful student empowerment programs: Effective instruction is based on what students actual know, can do, and want to do or to take part in. This requires frequent appropriate assessment at each academic level such as, in general, these recent examples:

Welcome to the KJS Group: Please register at Include something about yourself and your interest in student empowerment (your name, school, classroom environment, LinkedIn, Facebook, email, phone, and etc.).

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS:, 606 KB or, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - - 

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):


Wednesday, October 23, 2013

Alternative Multiple-Choice Origins

Two alternative forms of multiple-choice (AMC) to the traditional multiple-choice (TMC) developed from independent sources.  Geoff Masters from Melbourne, Australia, is credited as the developer of the parcel credit Rasch model (PMC), a form of Information Response Theory (IRT) analysis in 1982 (Bond and Fox).  It allows students to report what they know (2 points), what they do not know (1 point), and wrong answer (0 points). It never became popular on classroom or standardized tests.

The second form of AMC was developed at NWMSU. It started as net yield scoring (NYS) on both essay and multiple-choice. I needed a way to reduce the amount of reading required in scoring “blue book” essays. A 20-point essay started with 10 points. A point was added for acceptable, related, information bits. A point was subtracted for unacceptable, incorrect, unrelated information bits. An information bit was basically a short sentence with correct grammar and spelling. It could also be a relationship expressed as a diagram, sketch, or drawing.

This reduced the amount of reading by more than a 1/3 and improved student performance. Snow, filler, and fluff had no value but distracted a student from doing good work. Students needed to exercise good judgment in selecting what they wrote. This was no longer the case of their writing, and the teacher searching, for something that could earn them sufficient credit to pass the course; a lower level of thinking operation that is very common in high schools and colleges. NYS required students to use good judgment as well as be knowledgeable and be skilled.

This same idea was applied to computer scored multiple-choice tests with interesting results. When both TMC and NYS were offered on the same test, most students selected TMC on their first test. This is what they were familiar with. Over 90% of students elected NYS on their third test. Students also agreed that knowledge and judgment should have equal value.

By 1981 NYS was renamed knowledge and judgment scoring (KJS) to reflect what was being assessed: good judgment and a right answer (2 points), good judgment to report what has yet to be learned with no mark (1 point), and poor judgment, a wrong mark (0 points).

KJS requires and rewards students for using higher levels of thinking. The quality score is independent from the right count score. A struggling student with a test score of 60% may have also earned a quality score of 90%.

With TMC there is no way of knowing what a student with a score of 60% actually knows (when a right mark is a right answer or just luck on test day). With KJS we can know what this student knows with the same degree of accuracy as a student earning a 90% score on a TMC test.

More importantly, this reinforces the student’s sense of self-judgment and encourages effort to do better. It is the equivalent to the note a teacher marks on a special paragraph in an essay, “Good work!”

KJS provides the information needed to tell student and teacher what has been learned and what has yet to be learned in an easy to use report. Often a trail of bi-weekly test scores would follow a backward J. Reducing guessing by itself did not increase the test score but moved the score to a higher quality. Low quality students needed to change study habits. Low scoring high quality students needed to study more.

Learning by questioning and establishing relationships provided students the basis for answering question correctly that they had never seen before. They then stumbled onto what I meant by, “Make things meaningful (full of relationships) if your learning is to be really useful, empowering and easy to remember”. They did not have to review everything for each cumulative test.

The most interesting finding was that when students mastered meaning-making, they found themselves doing better in all of their courses. This is what inspired me to continue to promote Knowledge and Judgment Scoring. Students learn best when they are in charge. The quality score was the “feel good” score for struggling students until their improving development produced the high scores earned by successful self-correcting students.

Welcome to the KJS Group: Please register at Include something about yourself and your interest in student empowerment (your name, school, classroom environment, LinkedIn, Facebook, email, phone, and etc.).

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS:, 606 KB or, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - - 

Other free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, October 16, 2013

Knowledge and Judgment Scoring - Operational to Instructional

This post (and the next three) introduce why we need a KJS Group. The software, Power Up Plus (PUP), that contains both Knowledge and Judgment Scoring (KJS) and traditional multiple-choice (TMC) is now free to registered KJS Group members.  Version 5.22, is free to teachers and administrators. Please see instructions below. 

This reflects a change in use of the software as an operational program for scoring individual classroom tests, to use as an instructional program to promote student and teacher development in preparation for the CCSS movement assessments. Students and teachers can readily see the difference between lower and higher levels of thinking when students are offered the opportunity to report, in a non-threatening environment, what they actually trust they know and can do, that serves as the basis for further learning and instruction. Practice riding the tricycle is poor preparation for a riding test on a bicycle.

Last week I finished a series of 22 posts on this Multiple-Choice Reborn blog. The series makes clear, that no amount of “statistical work” can extract from TMC marked answer sheets, some of the claims now being marketed about them. These tests can, at best, only do a good job of ranking students.

They so imperfectly and incompletely tell us what students know and can do that North Carolina is now spending six months figuring out how and where to place the cut scores on their new CCSS traditionally scored end-of-grade, multiple-choice math test results. 

[They must guess where to put the cut score on the results from uncommitted, low scoring, improperly prepared students, who were guessing at the right answers to questions the test maker guessed, would produce a satisfactory score distribution, with high statistical reliability and precision. The more nonsensical the student mark data are, the more subjective the process.]

Accurate, honest, and fair testing can be done with Knowledge and Judgment Scoring and the partial credit Rasch model analysis. These methods allow students to report what they actually know and can do that is meaningful, useful, and empowering. Student development (the judgment to appropriately use all levels of thinking) is as important as knowledge and skills for successful students and employees (Knowledge Factor). 

The NCLB decade has laid the foundation for real change by making schools designed for failure (that promote students beyond their abilities, rather than developing the necessary abilities for their success) so bad and so visible, that something had to be done. The CCSS movement has rekindled the old alternative (to TMC) testing and authentic testing methods; with the addition of CAT and elaborate assessment methods.

My concern now is that, after expending a large amount of time and money on promoting the CCSS movement ideals, a major part of the assessments will once again be reduced back again to traditional guess testing at the lowest levels of thinking. 

Both KJS and TMC scoring can use the same test questions. In fact both methods are used on the same test to accommodate students working at all levels of thinking and with all degrees of preparation (PUP).

IMHO, KJS is a practical method of achieving the CCSS movement goals. It prepares students for  standardized tests presented at all levels of thinking.  [I still cannot predict when KJS or the partial credit Rash model will be used on standardized tests as current standardized tests are not designed to assess what students know or can do. They are designed, using the fewest questions, to produce an acceptable spread of student scores.]

Rather than a rank of 60 on a test, a student may get a quality score of 90% on questions used to report what the student actually knows and can do, as well as, a rank of right marks on the test using KJS. We now know what a “just passing” student knows with the same accuracy as a student earning a 90% score on a traditional test. This can be valuable formative assessment information. 

Letting students tell us what they know or can do makes more sense than the guessing game now in use during preparation and assessment. And over 90% of my students preferred Knowledge and Judgment Scoring after just two experiences with it. Even students like an honest and fair test over gambling for a grade.

Past performance in my classroom is no guarantee of performance in your classroom unless you are a likeminded teacher, administrator, or test maker.

[The Educational Software Cooperative, Inc. (non-profit) closed this year (2013) after 20 years of operation during which I was the volunteer treasurer. It was founded to maximize the benefits of an individual computer: infinite patience, non-judgmental, and best of all, instant formative feedback. That level of instruction and record keeping has now been surpassed by the necessity for district wide record keeping systems operating online assessments keyed to CCSS learning objectives.]

Welcome to the KJS Group: Please register at Include something about yourself and your interest in student empowerment (your name, school, classroom environment, LinkedIn, Facebook, email, phone, and etc.).

Free anonymous download, Power Up Plus (PUP), version 5.22 containing both TMC and KJS:, 606 KB or, 1,099 KB.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, October 9, 2013

Multiple-Choice Test Analysis - Summary

The past 21 posts have explored how classroom and standardized tests are traditionally analyzed. The six most commonly used statistics are made fully transparent in Post 10, Table 15, the Visual Education Statistics Engine (VESE) [Free VESEngine.xlsm or VESEngine.xls]. One more statistic was added for current standardized tests. Numbers must be meaningful, understood; to have valid, practical value.

  •       Count: The count is so obvious that it should not be a problem. But it is a problem in education.  Counting right marks is not the same as counting what a student knows or can do. Also a cut score is often set by selecting a point in a range from 0% to 100%. A cut score of 50 means 50%. But the test, when administered as traditional multiple-choice starts each student at 25% with 4-option questions. [There is no way to know what low scoring students know, only their rank.]

  •       Average: Add up all of the individual student scores and divide by the number of students for the class or test average score. [There is no average student.] Classes or tests can be compared by their averages just as students can be compared by their counts or scores.

  •         Standard Deviation (SD): Theoretically, 2/3 of the counts on a distribution of scores are expected to fall within one SD of the average. A very well prepared (or very under prepared) class will yield a small SD. A mixed class will yield a large SD with students with both very high and very low scores (many A-B and D-F, with few C grades).

  •       Item Discrimination: A discriminating question groups those who know (high scoring students) into one group and those who do not know (low scoring students) into another group. Every classroom test needs about ten of these to produce a grade distribution where one SD is ten percentage points (a ten point range for each grade).

  •       Test Reliability: A test has high reliability when the results are highly reproducible. Standardized tests, therefore, use only discriminating questions. They rarely ask a question that almost all students can answer correctly. Traditional multiple-choice, therefore, does not assess what students actually know and value. Traditional standardized tests can only rank students.

  •       Standard Error of Measurement (SEM): Theoretically, 2/3 of the time a student retakes the same test; the scores are expected to fall within one SEM of the average. The SEM value fits inside the range of the SD. “Jimmy, you failed the test, but based on your test score and your luck on test day, each time you retake the test, you have a 20% expectation of passing without doing any more studying.” The SEM precision is based on the reliability of the entire test.

  •       Conditional Standard Error of Measurement (CSEM): The CSEM is based (conditioned) on each test score. This refinement in precision is a recent addition to traditional multiple-choice analysis. It has been a part of the Rasch model IRT analysis for decades.

Even the CSEM cannot clean up the damage done by forcing students to mark every question even when they cannot read or do not understand the question. Knowledge and Judgment Scoring and the partial credit Rasch model do not have this flaw. Both accommodate students functioning at all levels of thinking and all levels of preparation.  These two scoring methods are in tune with the objectives of the CCSS movement.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, October 2, 2013

Visual Education Statistics - Conditional Standard Error of Measurement


[[Second Pass, 8 July 2014.  Equation 6.3 (cited below) in Statistical Test Theory for the Behavioral Sciences by Dato N.M. de Gruijter and Leo J. Th. van der Kamp, 2008, is the same as the calculation used in Table 29, in my 9 July 2014 post. On the following page they mention that the error variance is higher in the center and lower at the extremes. That distribution is the green curve on Chart 73. I did not see this relationship in the equation when this post was first posted, but do now in the visualized mathematical model (Chart 73).

Also the discussion of Table 24 has been updated to match the terms and values in Table 24.]]

Working on the conditional standard error of measurement (CSEM) is new territory for me. I always associated the CSEM with the Rasch model IRT analysis commonly used by state departments of education when scoring NCLB tests. I first had to Google for basic information.

If you are interested in the details, please check out these sources for sample (n-1) equations: (Equation 6.14 that corrects the relative variance was not included in the 2005 version of the current 2008 version. This represents a significant progress in applying test precision.)

  •        Absolute Error Variance                 Equation 5.39 p. 73
  •        Relative Error Variance                  Equation 6.3 p. 83
  •        Corrected Relative Variance           Equation 6.14 p. 91 or GED Equation 3 p. 9

My first surprise was to find I had already calculated the CSEM for the Nursing124 data when I put up Post 5 of this series (in Table 8. Interactions with Columns [Items] Variance, MEAN SS = 3.33) as I discovered five ways to harvest the variance [mean sum of squares (MSS)]. Equation 6.3 n, Table 22, produces the same result (test SEM = 1.75) when it divides by n [unknown population] rather than n-1 [observed sample].

[n = the item count. Test SEM = AVERAGE(CSEM).]

I then used what I learned in the last post to table data to obtain the conditional error variance for student scores (Table 23a). The 21 items in Table 22 became the number of right marks on each of 11 item difficulties on Table 23a. The values in this tabulation were then converted into frequencies conditional on the student scores; the sum of which added to one, for each score (Table 23b).

The absolute error variance for each score was computed by Excel (=Var.P). Multiplying the absolute error variance (0.14382) by the square of the item count (21^2) yields the relative error variance (63.42). [Equation 5.39 (0.14382) * n^2 = Equation 6.3 (63.42)] The square root of the relative error variance of each score yields the CSEM for that score. [An alternate calculation of the absolute error variance is shaded in Table 23b. Here the variance was calculated first and that value divided by the squared score to obtain the absolute error variance. This helps explain multiplying the absolute error variance by the squared item count to obtain the relative error variance for each score.]

The conditional frequency estimated test SEM was 1.68 (Table 23b). The conditional frequency CSEM values for each score were different for students with the same score. The CSEM values had to be averaged to get results comparable with the other analyses. These values generated an irregular curve, unlike the smooth curve for the other analyses (Chart 61). The conditional frequency CSEM analysis is sensitive to the number of items with the same difficult (yellow bars alternate for each change in value, Table 23b). The other analyses are not sensitive to item difficulty (yellow bars, in Table 22, include all students with the same score).

Complete curves were generated from Equation 6.3 for n-1 and for GED n-1 (Table 24). The GED n-1 analysis includes a correction factor (cf) for the range of item difficulties on the test [cf = (1- KR20)/(1-KR21)]. This factor is equal to one if all items are of equal difficulty. For the Nursing123 data it was 1.59; the difficulties ranged from 45% to 95%, from the middle of the total possible distribution to one extreme.

The CSEM values from the six analyses are listed in Table 24. Five are fairly close to one another. The GED n-1, with a correction for the range of item difficulties, is far different from the other five (Chart 61). Values could not be created for the full curve for conditional frequencies as you must actually have student marks to calculate conditional frequency CSEM values. The gray area shows the values calculated from an equation for which there were no actual data. Equations produce nice looking, “look right”, reports.

The CSEM improves the reportable precision on this test over using the test SEM. Good judgment (best practice) is to correct the CSEM values as done on the GED n-1 analysis.

[I did not transform the raw test score mean of 16.8 or 79.8% to a scale score of 50% as was done by Setzer, 2009, GED, p. 6 and Tables 2 and 3. The GED n-1 raw score cut point was 60% which is comparable to most classroom tests. If 25% of the score is from luck on test day that leaves 35% for what a student marked right as something known or could be done, as a worst case. If half of the lucky marks were also something the student knew or could do, the split would be about 10% for luck on test day and 50% for student ability.]

In Table 24, the GED n-1 analysis test SEM of 2.98 for the Nursing124 data is, as a range, 2.98/21 or 14.19%. For the uncorrected Equation 6.3 n-1 analysis, 1.79, the range is 1.79/21 or 8.52%. The n SEM was 1.75 or 7.95%. The n SEM range, 1.75, fits within the uncorrected n - 1 test SEM value, 1.79. The corrected GED n-1 test SEM value, 2.98, exceeds it.

Student score CSEM values are even more sensitive than the test SEM values. The maximum range for the GED n-1 analysis is 3.73 or 3.73/21 or 17.76% and for the Equation 6.3 n-1 analysis 2.35 or 11.19%. Both are beyond the maximum n CSEM value of 2.29 or  10.41%. This low quality set of data fails to qualify as a means of setting classroom grades or a standardized test cut score.

[However the classroom rule of 75% for passing the course and the rule for grades set at 10 percentage points over rule these statistics. Here is a good example that test statistics have meaning only in relation to how they are used. If the process of data reduction and reporting is not transparent, the resulting statistics are suspect and can produce extended debates over a passing score in the classroom.]

The CSEM for each student score does improve test precision. It can be calculated in several ways with close agreement. But it cannot improve the quality of the student marks on the answer sheets made under traditional, forced-choice, multiple-choice rules. These tests only rank students by the number of right marks. They do not ask students, or allow students to report, what they really know or can do; their judgment in using what they know or can do.

The CCSS movement is now promoting learning at higher levels of thinking (problem solving) with, from which I have learned, some de-emphasis  on lower levels of thinking that are the foundation for higher levels of thinking. A successful student cycles through all levels of thinking, as is needed. Yet half of the CCSS testing will be at the lowest levels of thinking, traditional multiple-choice scoring. The other half will be as much of an over kill as traditional multiple-choice is an under kill in assessing student knowledge, skills, and student development to learn and apply their abilities. Others have this same concern that centralized politics (and dollars) will continue to overshadow the reality of the classroom.

There is a middle ground that makes every question function at higher levels of thinking, allows students to report what is meaningful, of value, and empowering, and has the speed, low cost, and precision of traditional multiple-choice. Knowledge and Judgment Scoring and partial credit Rasch model IRT are two examples. They both accommodate students functioning at all levels of thinking. Lower ability students do not have to guess their way through a test. With routine use, both can turn passive pupils into self-correcting highly successful achievers in the classroom. If you are really into mastery learning, you can also try something like Knowledge Factor.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):