Wednesday, May 29, 2013

Visual Education Statistics - Lower Limits

I discussed the relationships between statistics in Post 11 based on changing individual answer marks. This post will start the exploration of the limits on these statistics based on student scores and item difficulties. How much change is produced by a specific strategy?

These statistics can all be used to describe what has happened in meaningful useful ways.  Here I begin to look at their ability to predict what may happen. This concern has practical value in being able to safely package descriptive and predictive statistics for maximum marketing effect. Or reworded, are testing companies really delivering what they claim?

Chart 32 relates observations from seven exploratory strategies:
  1. A perfect Guttman scaled table in a 21 by 21 cell field.
  2. A 21 by 20 Guttman table missing the item with a difficulty of zero.
  3. A 20 by 20 Guttman table also missing the student score of zero.
  4. A 20 by 20 normal curve distribution based on the SD of the above table.
  5. A 20 by 20 distribution based on a typical classroom, SD = 10.
  6. A 20 by 20 bimodal distribution at 35% and 65% student scores.
  7. A 20 by 20 perfect world distribution for the above table.

The three Guttman scaled tables produced similar results. Removing one or both zero values had little effect. The SD remained around 30% (Chart 32).

The normal curve table was also similar to the 20 by 20 Guttman scaled table but with a reduced SD (25.26%).  I was impressed that a normal curve distribution based on the 20 by 20 table SD had so little effect on the SEM and KR20. It did indicate that a shortened score range lowers the SD.

I then configured a table for a typical classroom SD. The SD = 10 table produced the smallest SEM (4.87%). SD = 9.25%. Clearly, to have a small SEM, you must have a small SD. But you also run the risk of low test reliability (0.72).

I next configured the table for a typical classroom distribution generally seen when using Knowledge and Judgment Scoring (KJS). Routine use of KJS produces and sorts out those students, who have learned to be comfortable using all levels of thinking, from the remaining passive pupils in the class who continue to select traditional multiple-choice (TMC). The bimodal table produced an SEM of 5.98%. SD = 16.38%.

I then compared a bimodal table and a perfect world table (see Post 7 in this series). In a perfect world all students receive the same low score or the same high score. These modes were set at the same 35% and 65% modes on both tables. The perfect world table (Table 18) produced similar results, SEM (5.39%). SD=15.39%.

(Free download of Table 18: or .xls)

I explored the perfect world table further after seeing that the results for the perfect world table and the bimodal table were close at the 35%-65% mode locations. Chart 33 shows the effects of changing the range of the right and wrong modes on a perfect world table. Increasing the range increased the SEM (red) and test reliability (purple). The first effect is bad, the second effect is good. The price for a high test reliability is a reduced ability to tell if two student scores are significantly different.

Post 11 shows linear relationships when changing individual marks, except for item discrimination. This post, working with student scores and item difficulties, shows a linear relationship for item discrimination: AVG PBR of 0.1, 0.2, 0.3, 0.4 and 0.5 (Chart 33). As the two answer modes are moved farther apart, the average PBR and SD increase linearly, but appear curved on the log base 10 scale.

Standardized tests tend to have high SDs. The score distributions tend to be flat, multi-modal. This situation is related to high test reliability and high SEM (Chart 33).

A standardized test must do better than this: under the best possible conditions a SEM of about 5% is related to a test reliability of about 0.90 with the modes set at 35%-65% and a SD of 15%. That would take a difference in score from one year to the next of 3 x 5 = 15% or one and one-half letter grade to support an acceptable improvement (difference) in student performance but with an unattainable test performance.

The above tables were populated with items drawn from a Guttman scaled table with all items set at their maximum item discrimination. The results then represent the best obtainable, the maximum limit for a 20 by 20 table. We need more students and more test items.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 22, 2013

Visual Education Statistics - Basic Relationships

The first ten posts in this series developed a visual education statistics (VES) engine that relates six statistics on one Excel spreadsheet. This post explores their relationships by switching right and wrong marks (1 and 0) in matched pairs and in unmatched single switches at increasing distances from the diagonal equator.

A Guttman table is an extreme distribution with each student receiving a different score. Each item also has a different difficulty. Item discrimination is set at the maximum. There is only one possible distribution for this 21 student by 20 item test (Table 17). (The Excel .xlsm or .xls version is available from

The squared student score deviations are at zero at the test score mean and at a maximum (100) at the extremes. The opposite is the case for item sums of squares (SS) with a maximum of 5.24 at the mean of 10.5 and a minimum of 0.95 at the extremes. This makes sense as there is greater variation between student score extremes and less within item difficulty extremes (Table 17).

The standard deviation (SD) of student scores decreased (6.205 to 6.050) as matched pair switching progressed from the mean to the extreme in a linear manner (Chart 28). This makes sense as the student score deviations normally increase at the extremes. Switching marks reduced these extremes.

Test reliability also fell as matched pair switching progressed from the mean to the extreme in a linear manner (Chart 29). This makes sense as the student score N MEAN SS decreased as the switching progressed from the mean to the extreme (36.381 to 34.857 or 1.524) and as the item N MEAN SS only decreased (-3.492 to -3.574 or 0.082).

The standard error of measurement (SEM) increased linearly (1.354 to 1.423) as the switching progressed from the mean to the extreme (Chart 30). This too makes sense as a decrease in test reliability is related to an increase in the SEM.

Item discrimination (KR20 and Pearson r) decreased in a non-linear manner (Chart 31) as the switching progressed from the mean to the extreme (from 0.676 to 0.637). This also makes sense as the greater the change from a perfect Guttman table, the lower the item discrimination. Switched marks that are the farthest from the diagonal equator are the most unexpected marks.

A second scan of the Guttman table with an unbalanced single switch of right and wrong produced the same relationships as the balanced switch scan. The spreadsheet (Table 16) needed to be set to three decimal spaces to capture the detail with a minimum of rounding errors (Table 17).

The VES engine is showing three linear relationships (SD, test reliability, and SEM) and one nonlinear relationship (item discrimination). Just one switch of 1 to 0 or 0 to 1 can be detected in all four statistics. I find it interesting that such detail can be captured from a 21 x 20 table.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 15, 2013

Visual Education Statistics - Visual Education Statistics Engine


The Visual Education Statistics Engine (VESEngine) contains all six of the commonly used education statistics (Table 15).

The relationship between the first five seems clear. Item discrimination, the sixth statistic in the series, needs a bit more work.

The six visual education statistics in the VESEngine (Table 15):
The Visual Education Statistics Engine
1.     Count
The number of right marks for each student is listed under RT; the number of right marks for each item by RIGHT.
2.     Average
The average student score is listed under SCORE MEAN; the average of right marks for each item by MEAN.
3.     Standard Deviation
The standard deviation (SD) for student scores is listed under BETWEEN ROW OR STUDENT as N SD and N – 1 SD for large and small samples.
4.     Test Reliability
The N – 1 test reliability is listed for KR20 and Cronbach’s alpha.  The N sources for the calculation are color coded. Select an ITEM # and then click the TR Toggle button to view the effect of removing an item from the test.
5.     Standard Error of Measurement (SEM)
The SEM calculation is listed with the N – 1 sources color coded. This ends the sequence of calculations dependent upon the previous statistic.
6.     Item Discrimination
Click the Pr Toggle button to view the UNCORRECT and CORRECT N – 1 item discrimination values.

The VESEngine is now ready to explore a number of things and relationships. The goal is to make traditional multiple-choice measurements more meaningful and useful. You can start by changing single marks or pairs of marks. The engine will do the work of recalculating the entire table except for item discrimination; that requires clicking the Pr Toggle button.

I have been concerned with how the calculations were made as much as why they were being made. This series needs to end with consideration of what meaning is assigned to the calculations.  The six statistics present three different views:
Numbers You can Count (Descriptive)
A Combination of Count and Prediction

Predictive Ratios without Dimensions

I loaded a perfect Guttman table into VESEngine and renamed it VESEngineG (Table 16).

Download free from or .xls (Table 15).
Download free from or .xls (Table 16).

I compared the item analysis results from Nursing124 and a perfect Guttman table to get an idea of what the VESEngine could do.
Nursing124 (22x21)
Guttman Table (21x20)
Student Scores
Test Reliability
Item Discrimination Corrected
Standard Deviation,
N – 1
Standard Error of Measurement

The data sets represent two different types of classes. The Nursing124 data are from a class preparing for state licensure exams (80% average class score). Mastery is the only level of learning that matters. The Guttman table is both theoretical and near to the design used on standardized tests (50% average score). These average scores are descriptive statistics.

The two predictive statistics, test reliability and item discrimination, values are markedly different for the two tests. The Guttman table yielded a test reliability of 0.95 that puts it into a standardized test ranking. It did this with an average item discrimination ability of only 0.52. The Nursing124 data resulted in an item discrimination ability of only 0.09. Both of these values are corrected values. The value of 0.09 is just below the limit for detecting item discrimination (0.10) and is confirmed by the ANOVA F test as just below the limit for being different from (the many classroom and testing aspects of) chance. This makes sense.

[Power Up Plus (PUP) printed out a value of 0.26 for the average item discrimination. This in the uncorrected value for the Nursing123 data. This is the only error I found in PUP: The average item discrimination was not updated when the routine for correcting the item discrimination was added.]

The Nursing124 data Standard Deviation (2.07 or 9.86%) is much smaller than the SD (6.20 or 31.00%) for the Guttman table. This makes sense. The mastery data have a much smaller range than the Guttman table data. What is most interesting is that in spite of the larger SD range for the Guttman table data, it resulted in a smaller SEM (1.35 or 6.77%) than the Nursing123 mastery data (1.74 or 8.31%). 

Even though the Guttman table data have a SD 3 times that of the Nursing124 data, by having an item discrimination over 5 times the Nursing124 data, they produced a Standard Error of Measurement a bit less than the Nursing124 data. This interaction makes more sense when visualized (Chart 26). The similarity of the SEMs indicates that widely differing tests can yield comparable results. 

Item discrimination has been improved over the years. With paper
and pencil, the Pearson r was difficult enough. Computers enable calculations that remove the right mark on the item in hand from the related student score before calculating each item’s discrimination ability. No correction is needed. The difference in uncorrected past and corrected current results is striking (Chart 27). Also see the previous post on item discrimination.

The literature often mentions that the best standardized test is one with many items near the cut score in difficulty and with a few widely scattered in difficulty. At this time I can see that the widely scattered items are needed to produce the desired range of scores. Many items near the cut score produce a lower SD and a lower SEM. You can use the VESEngine to explore different distributions of item difficulty and student ability.

Is there an optimum relationship in an imperfect world? Or will the safe way to proceed with standardized tests remain: 1. Administer the test; 2. View the preliminary results; and 3. Adjust to the desired final result? IMHO, this method does in no way reduce the importance of highly skilled test makers working from predictions based on field tests or trial items included in operational tests.

Download free from or .xls (Table 15).
Download free from or .xls (Table 16).

[The VESEngine has two control buttons that function independently. The Pearson r Button refreshes item discrimination. The test reliability button (TR Toggle) removes a selected item from the test and then restores it on the second click.

Set a smaller matrix by removing excess cells with Remove Contents, as shown on the perfect Guttman table (Table 16) where the most right column and lowest row have been cleared of contents. The student score mean and item difficulty mean (blue) were then reset from 22 and 21 to 21 and 20.

Create a larger matrix by inserting rows within the table (not at the top or bottom). Insert columns at column S or 19. Then drag the adjacent active cells to complete the marginal cells. Finally edit the two button TableX and TableY values in Macro1 and Macro2 to match the overall size of your table.

Please check your first results with care as I have found it very easy to confound results with typos and with unexpected changes in selected ranges, especially when copying and enlarging the VESEngine.]

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 8, 2013

Visual Education Statistics - Item Discrimination Engine

Statistic Six: Item discrimination, the last statistic in this series of posts, captures the ability of an item to group students by what they know (and by what they have yet to learn, with Knowledge and Judgment Scoring or partial credit Rasch model scoring). Previous posts have indicated that this ability may be primary in selecting items for standardized tests. It is also important in the classroom. Discriminating items produce the spread of scores needed for setting grades in schools designed for failure.

I left this statistic to last as it is a bit different from the others. It is more complex and difficult to calculate. However, the standard error of measurement (SEM) engine, post 8, only needed one more step to have the numbers in hand to calculate the Pearson r estimate of item discrimination.

Pearson worked out his item discrimination in a manner that follows the previous posts. He did this by 1895, long before we had personal computers. As a consequence we now have two versions, called the original uncorrected estimate (Excel Pearson function) and the corrected estimate. There is also a shortcut for traditional multiple-choice (TMC) tests: the point biserial r (PBR) I consider at the end of this post.

A visual presentation of the Pearson item discrimination calculation follows (see Table 11 for the calculations).

First, the marks in the Item 4 column on the Guttman table (Table 12) are counted (10), the average obtained (0.45 out of 22), and the deviations from the mean obtained (Chart 20).  

The same process is carried out on the student score columns (RT of 369 and SCORE MEAN of 16.77 out of 22, see Chart 21).

When each of these two charts is summed, it adds to zero. This time the individual values are not squared to make them all positive as in Charts 22 (scores) and 23 (items). Instead the related item and score deviations are multiplied to produce positive and negative values (Chart 24 and Table 11) that sum to 13.27.

The item discrimination is then a ratio between two sums of squares (SS). This operation is carried out for each item on the test:

Multiplying the two SSs in the denominator (after taking their square roots) changes negative values to positive values and yields a grand SS (2.34 x 9.49 = 22.21). The resulting ratio is the discrimination ability of the item. It can range from a minus one to a positive one. Values above 0.9 are characteristic of standardized tests. Values for classroom tests will be discussed later.

Table 12 contains an Item Discrimination Engine you can use to explore the discrimination ability of individual items. [Download free from or .xls]

The point biserial r (PBR) provides an additional glimpse into what is taking place (Table 13).  The difference between the average right marks and wrong marks (18.1 – 15.67 = 2.43) is standardized by dividing by the standard deviation (2.43/2.07 = 1.176). Multiplying the difference between right and wrong mark means in standard units (1.176) by the proportion (p and q) of right and wrong marks, Sqrt(0.45 x 0.55) = 0.2475,  yields the PBR item discrimination of 0.59.

The real value or meaning of an item discrimination rank seems to be a matter of tradition and advances in computing power. PUP 5.20 prints out corrected item discrimination values that I gave the following rankings for my classroom tests:

[The PBR only works for traditional multiple-choice, that only ranks students. PUP contains the Pearson r that is required for Knowledge and Judgment Scoring, an actual assessment of what students know and can do, that is meaningful and useful in future assignments.]

Item discrimination weights each right and wrong mark with the related student score. Different column mark patterns produce different results. Unlike test reliability, when calculating item discrimination the order, or pattern, of marks is important. Items of the same difficulty can have very different discrimination ability, for example, items 11, 14, 15, 16 and 18 with a difficulty of 91% and a range of item discrimination of -0.02 to 0.58 (Chart 25).

Selecting difficult items is not sufficient to maximize test reliability. The primary need is to write discriminating items. The Nursing124 data delivered discriminating items at all levels of difficulty from 45% to 91% (Chart 25).

The item discrimination results seemed to me to be as unpredictable as test reliability results. IMHO only a visual education statistics engine that combines all six statistics can readily display the interactions.

The standard error of student score measurement (SEM), the test reliability (KR20, and alpha), and the item discrimination (Pearson and PBR) have unpredictable interactions. The Test Performance Profile from PUP 5.20 brings these together in one table for easy use in the classroom by students and teachers (and other interested persons) but lacks the flexibility of a single sheet spreadsheet engine.

[PUP 5.20 only prints the PBR ranks as an efficient aid for teachers. An additional aid is provided by sorting the discriminating items on PUP 5.20, sheet 3a. Student Counseling Mark Matrix with Mastery/Easy, Unfinished, and Discriminating (MUD) Analysis.]

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, May 1, 2013

Visual Education Statistics - Teacher Effectiveness

The information that needed to be related in post 7, became too long for one post. Post 7 contains the SEMEngine; all five of the related statistics on one spreadsheet.  This post relates a collection of stuff that gives those statistics additional meaning; a bit of understanding needed to use them properly.

The SEMEngine, in the previous post, can produce the unpredictable statistics relevant to classroom tests and standardized tests. But a full understanding of these statistics requires a discussion of a second standard error and the two methods of scoring multiple-choice (traditional multiple-choice, TMC, and Knowledge and Judgment Scoring, KJS); partial and full disclosure of that a student knows and can do.

The standard deviation (SD) of the group test score and the standard error of measurement (SEM) of the average student test score provide guidance in constructing standardized tests as predictive inputs. These statistics are also helpful in describing classroom test results. The first refers to test results from the class or group taking the test, the average group score; the second, to the average student score in the class. They are two different perspectives of the same average score. They have different uses.

There is a second standard error, the standard error of the mean (SE) that permits comparison between group test scores. [I am belaboring this topic as the two standard errors (of the mean and of measurement), the abbreviation (SEM) and even the SD can get confused (Standard error and Standard error vs. Standard error of measurement).

Chart 18 shows how the SEM of the average student score is reduced as more equivalent items are added to the Cantrell data of 14 items. A 50 item test is expected to yield a SEM of 5.15%. This is less than 1/3 the range of the SD. But even this would require an improvement of 3 x 5.15 = 15.45% for a significant increase in performance from one year or one test to the next. That is 1.5 times a traditional letter grade. To my knowledge, very few standardized tests use 50 items in any topic or skill area.

Chart 19 shows how the SE of the classroom or group test score is reduced as more equivalent items are added to the Cantrell data.  The SE has a finer resolution than the SEM. An improvement in class performance on a 50 item test, 3 x 2.57 = 7.71% would require only about a 3/4 letter grade to show a significant difference in the two test scores from two different classes or one class at two different times. This shows that it is easier to show a significant difference between the average scores from two tests than it is between two scores from the same student.

[The above can be generalized to support the traditional score range of 10% per letter grade.]

I retitled this post as “Teacher Effectiveness” after looking at the above two charts (18 and 19). These statistics provide a means of measuring teacher effectiveness; or at least ranking teacher effectiveness. To measure teacher effectiveness, the portion of students electing TMC or KJS on the test would also have to be included. 

[A class selecting mostly TMC is in a lower level of thinking classroom environment populated with passive pupils conditioned to mark an answer to every item. A class selecting mostly KJS is in a higher (all) level of thinking classroom environment populated with self-motivated, self-correcting high quality achievers who are mature enough to distinguish between what they have yet to learn and what they know and can do that can serve as the basis for further learning and instruction.]

Student development is as important as knowledge or skill. The CCSS movement promotes this idea too but without the simplicity of multiple-choice (in time and money).

These visualized statistical models of the real world have been found to have practical value in making predictions (a most expected mid-point on a range of possibilities). However, what we feed into these statistics determines the validity and usefulness of the results. The concrete reality that you got a score of 50% on a classroom test becomes transformed into an abstract prediction that, +- 1 SD, that score (and your next score on an equivalent test) just might have been anywhere between 30% and 70% on an equivalent standardized test. And further, using the SEM, the range may be reduced to between 45% and 55% (generalized from Table 18).

Test scores (and these first five reviewed statistics) are easily manipulated by the selection of questions on the test and how the test is scored. The traditional multiple-choice test (forced-choice test) is a game with a built in handy-cap of over 20%. This manipulation of scores is so traditional (so hardened to change) that little thought is given to it with the exception of when elementary school students take their first multiple-choice tests.

Learning to lie is difficult for serious students; they know a best guess is not a reflection of their abilities. It is just sugar coating and a distraction from the ugly truth. Students with equal abilities, but receiving lower test scores, rightly feel cheated by their poor luck on test day. In time, these students just mark, finish the test, and then get back to their world where they do have some control.  Since there is no way of knowing if a right mark is a right answer or a lucky answer, there is no need to take the test seriously except for where their score falls in the class distribution (their rank).

[This practice is institutionalized when their class rank is provided in college admission documents.]

The traditional multiple-choice test (TMC) is fast, cheap, and marketed way beyond its valid ability to rank students IMHO. It is, as my students put it, Dumb testing. The statistics are not an accurate, honest and fair reflection of their individual abilities.

TMC IMHO drives students away from developing into self-motivated, self-correcting, high quality achievers.  Statistics will not change the outcome. There is a better (alternative) method of multiple-choice assessment, KJS, at no additional cost that will guide their development. An effective teacher motivates students to be ready to learn and to want to learn. 

A multiple-choice test can be used to permit students to report what they actually know, understand, and find useful as the basis for further learning and instruction. All that is required is an extraction of student judgment (something that is considered an essential part of almost all alternative and authentic assessments and soon the elaborate CCSS assessments). Please check out Smart testing: Knowledge and Judgment Scoring, partial credit Rasch model, and Confidence Based Assessment, for example. All three promote student development that yields high test scores, long term, and with a minimum of review.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):