Wednesday, April 24, 2013

Visual Education Statistics - Standard Error of Measurement Engine

We have now gone from the real world of counting and test scores through three stages (average, standard deviation and test reliability) of calculating and relating averages. The next step is to return as best as these abstract statistics can to the real world. The problem is that they only see, represent, the real world as portions of the normal curve of error (Standard error and Standard error vs. Standard error of measurement). They no longer see individual test scores.

If a student could take the same test several times, the scores would form a distribution with a mean and standard deviation (SD). That mean would be a best estimate of the student’s true score on that test. The SD would indicate the range of expected measurements. Some 2/3 of the time the next test score is expected to fall within one SD of the mean.

[This makes the same sense as a person going to a baseball game to watch a batter who has averaged a hit 50% of the time in his last 20 games. That is a descriptive use of the statistic (0.500). If the person bets the batter will do the same in the current game; that is a predictive use of the same statistic (0.500). “Past performance is no guarantee of future performance.” The SD gives us a “ballpark” idea of what may happen. The following statistic promises a better idea.]

The standard error of measurement (SEM), statistic five in this series, estimates an on average student score range centered on the class mean. It is not the specific location of the next expected test score. It is not a range tailored to each student. It is tailored to the average student in the class.

The accuracy and precision of this estimate is important. The range of the SEM sets the limit on how much of an increase in test score is needed, from one year to the next, to be significant. The smaller the range, the finer the resolution.

Students, of course, do not retake the same test many times to generate the needed scores for averaging. The best the psychometricians can do is to estimate the SEM using the SD of student scores (2.07) and the test reliability (0.29) in Table 10.

SEM = SQRT(MSrow) * SQRT(1 – KR20)
SEM = SQRT(4.28)*SQRT(1 – 0.29)
SEM = 2.07 * SQRT(0.71)
SEM = 2.07 * 0.84 = 1.75 or 8.31%

A portion of the average, N – 1, student score Variance (MSrow in the far right column on Table 10 of 4.28 or the SD of 2.07) is used to estimate the SEM.  The portion is determined by the test reliability, KR20 (0.29). The SEM for the Nurse124 data (1.75) can also be expressed as 8.31% (Table 10).

The SD for the student score mean (0.799 * 100 = 79.87%) was 9.85% (2.07/21). With a test reliability of only 0.29, the SEM is little better (smaller) than the score SD (SD 9.85% and SEM 8.31%).

Charts 15 and 16 show what the above actually looks like in the normal world.  Chart 15 is for a set of data (Nursing124) that is just below the boundary of the 5% level of significance by the ANOVA F test. The F test was 1.31 and the critical value was 1.62. The SEM curve (1.75 or 8.31%) is close to the normal SD of the average test score (2.07 or 9.85%). The SEM is only a 15.63% reduction from the SD.

Chart 16 is for a set of data (Cantrell) that is well above the boundary of the 5% level of significance by the ANOVA F test. The F test was 3.28 with a critical value of 2.04. The SEM curve (1.21 or 8.68%) is much narrower than the normal SD of the average test score (2.55 or 18.19%). The SEM is a 52.55% reduction from the SD.

This makes sense. It follows that the higher the test reliability, the lower (the shorter the range of) the SEM on a normal scale. Do these statistics really mean this? Most psychometricions believe they do.

 [Descriptive statistics are used in the classroom on each test. Rarely are specific predictions made. Standardized tests are marketed by their test reliability and SEM. This is the same change IMHO as changing from amateur to professional in sports. It is no longer how you play the game and having fun but winning. Every possible observation is subject to examination.]

I again made use of the process of deleting and restoring one item at a time to take a peek at how these statistics interact. [SEMEngine, Table 10, is hosted free at and .xls.]

The SEM (red) is a much more stable statistic than the test reliability (blue dot) across a range of student scores from 55% to 95% (Chart 17). Two scales are involved: a ratio scale of 0 to 1 and a normal scale of right counts. The lowest trace (a ratio) on Chart 17 is inverted (second trace) and then multiplied by the top trace (SDrow in counts) to yield the SEM in counts.

Even more striking than the stability of the SEM (red) are the parallel traces of the student score standard deviation (SDrow, green dot) and the test reliability (KR20, blue dot). This makes sense. When the student scores spread out, the Variance (MSrow) also increases, which increases the test reliability (KR20). I was surprised to see the two so tightly related.

Chart 17 also includes the SQRT(1-KR20) (blue triangle). This inverts the KR20 (blue dot). The stable SEM (red) then results from multiplying this inverted value by the student score SDrow (green dot). This makes sense. Multiplying a number by its reciprocal yields one; but in this case, a two-step process includes two closely related numbers.
[In designing the forerunners of PUP, I discarded stable statistics as IMHO they seemed to be of little descriptive value in the classroom. That is not true for standardized tests where the goal is to use the shortest test possible composed of discriminating items (no mastery or unfinished items).]

The SEM engine now contains the first five of the six statistics commonly used in education. In the next post I will explore the relationship between the SEM of individual student scores and the SE of the mean of the class score, the average test score. These have little meaning in the classroom but are IMHO very important in understanding standardized testing.

[To use the Test Reliability and Standard Error of Measurement Engine for other combinations than a 22 by 21 table requires adjusting the central cell field and the values of N for student and item. Then drag active cells over any new similar cells when you enlarge the cell field. You may need to do additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.

To reduce the cell field, use “Clear Contents” on the excess columns and rows on the right and lower sides of the cell field. Include the six cells that calculate SS that are below items and to the right of student scores. Then manually reset the number of students and items. You may need additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.]

A password can be used to prevent unwanted changes to occur in the SEMEngine.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out of traditional-multiple choice (TMC) and into Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, April 17, 2013

Visual Education Statistics - Test Reliability Engine


I used Table 8 (test reliability) as the foundation for the test reliability engine (Table 9).  The whole point of doing so was to provide a means of seeing the interactions when marks (Item scores of 1 and 0) are changed in a row or a column.

I removed the six most left columns from Table 8 as they are not needed after verifying the ANOVA table data in the previous post. The ANOVA Between Row and Count values (yellow) are converted from the normal Between Row and Count values.

The first thing I noticed was that rounding errors are no longer a problem with everything on one Excel worksheet. The results on Table 9 have been edited into prior posts.

Table 9 consists of the mark scores (1’s and 0’s) in a central cell field (22 students by 21 items). With the exception of the conversion from normal values to ANOVA values based on the Grand Mean (0.799), all other values are the same as on Table 8.

Test reliability is calculated with the KR20 and Cronbach’s alpha (0.29) as shown on Table 6. Table 9 contains an explained ANOVA table for between rows (student scores).

The second thing I learned was that sorting 1’s and 0’s in item columns so that all 1’s were at the top of the column and all 0’s were at the bottom produced a marked change in test reliability. This did not change item difficulty.

Any item with all 1’s in one group and all 0’s in another is set for maximum discrimination. Increasing discrimination increases test reliability because increasing discrimination increases the variation within student scores.

This makes sense. A test that accurately groups those who know and those who do not know is more reliable than one in which the marks scored 1 and 0 are mixed in a Guttman table.

Download TREngine for MAC and PC: TREngine.xls or TREngine.xlsm and save, or run in your browser. (When it does not work, some helpful information is frequently offered by the operating system.)

Deleting an item and replacing it to find which items contribute the most, or the least, to test reliability has been automated. Select the item number (ITEM #) in the bottom row of Table 9. Then click the Toggle button for your results. Click the Toggle button again to restore the item before selecting another item.

A scatter chart from all 21 single item deletions indicates that difficulty is not the primary factor in test reliability. Deleting the two most negative discriminating items increased test reliability the most. Deleting the most discriminating item decreased test reliability the most. The Spearman-Brown prediction formula estimated that a test reliability of 0.28 would be expected, after decreasing the number of items from 21 to 20, when doing the deletions.  The test reliability for all 21 items was 0.29.

The third thing I learned was that a 22 by 21 matrix is very unstable. I could only detect this with all four of the discussed statistics on one active Excel sheet. Changing a single mark from right to wrong or wrong to right in over 25 cells resulted in a range of change from 0.29 to a low of 0.21 to a high of 0.36 in test reliability. Cells around the edge of the cell field seemed to be the most sensitive. This range in sensitivity, suggests there is more information in this matrix than just harvesting variation with the Mean SS or Variance. Winsteps harvests unexpectedness from the matrix.

Table 9 combines four education statistics (count, average, standard deviation, and test reliability). It clearly shows that the more items on the test (the more Variance summed) and the more discriminating the items, the higher the test reliability. Table 9 also provides an easy way to explore ALL of the effects of changing an item or even a single mark. I could not have finished the last post without using it. Understanding is having relationships in mind. Table 9 dynamically relates facts, which in the traditional case, are usually presented in isolation.

[To use the Test Reliability Engine for other combinations than a 22 by 21 table requires adjusting the central cell field and the values of N for student and item. Then drag active cells over any new similar cells when you enlarge the cell field. You may need to do additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.

To reduce the cell field, use “Clear Contents” on the excess columns and rows on the right and lower sides of the cell field. Include the six cells that calculate SS that are below items and to the right of students scores. Then manually reset the number of students and items. You may need additional tweaking. The percent Student Score Mean and Item Difficulty Mean must be identical.]

A password is used to prevent unwanted changes to occur. The password is “PUP522”.

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand how to break out from traditional multiple choice (TMC) to Knowledge and Judgment Scoring (KJS) (tricycle to bicycle):

Wednesday, April 10, 2013

Visual Education Statistics - Test Reliability


Statistic Four: An estimate of test reliability or reproducibility helps tune a test for a desired standard by replacing, adding, or removing items. This is helpful in the classroom. It is critical in the marketing of standardized tests. No one wants to buy an unreliable test. Time and money require the shortest test possible to meet desired standards.

There is no true standard available on which to base test reliability. The best that we can do is to use the current test score and it’s standard deviation (SD). The test score is as real as any athletic event score, or weather or stock market report. “Past performance is no guarantee of future performance.” The SD captures the score distribution in the form of the normal curve of error, as described in previous posts.

A Guttman table (Table 6) shows two ways to calculate the Mean Sum of Squares (MS) or Variance within item columns (2.96). The first uses the Mean SS as discussed in prior posts. [Mean Sum of Squares = Mean SS = Mean Square = MS = Variance] The second uses probabilities based on the difficulty of each item. The results are identical, 2.96 for large data sets (N) and 3.10 for classroom sized data sets (N – 1).

The KR20 and Cronbach’s alpha are then calculated using the ratio of the within item columns MS (2.96) to the student score row MS (4.08). [(21/20)*(1-(2.96/4.08) = 0.29] A test reliability of only 0.29 is very low.

The mean square within item columns MS (MSwic) must be relatively low to the student score row MS (MSrow) to obtain a high test reliability estimate.

But the more difficult an item is, the larger the contribution to the MSwic. The easiest item at 95% yields a Variance of 0.05. The most difficult item at 45% yields 0.25. To increase test reliability, the MSrow must increase, and the MSwic must decrease, in relation to one another.

The Unfinished and Discriminating items (Table 7) have similar difficulties: 73% and 71%. The test reliability increased from 0.29 to 0.47 when I deleted the eight (yellow) Unfinished items: 3, 7, 8, 9, 13, 17, 19, and 20 on Table 6. The MSwic fell 50% but MSrow fell only 36% to produce the increase in test reliability from 0.29 to 0.47. Getting rid of non-discriminating items helped.

A number of factors affect test reliability. Easy items (10, 12 and 21 in Table 6) yielded little to the Variance. We need easy items in the classroom to survey what students have mastered. Easy items are a waste of time and money on standardized tests designed only to rank students. Easy items do not spread out student scores. Easy items do little to support the student score MSrows.

This test only has 21 questions (Table 7). If the test had been 50 items long the estimated reliability would be 0.49, and with 100 items it would be 0.66. The test was too short using the current items. Doubling the length of this test (21 items to 42 items) by including a duplicate set of mark data increased the estimated test reliability from 0.29 to 0.65. MSwic doubled (twice as many items) but MSrow increased four times (the doubling of the score deviation was squared).

[There seems to be a discrepancy between the Spearman-Brown prediction formula in PUP 5.22 and the actual doubling of the length of this test with identical mark data on an Excel spreadsheet (22 to 50 students yields 0.29 to 0.49 compared to 22 to 44 students yields 0.29 to 0.65) That is, a lesser increase in students (27 and 22) produced a larger change in results (0.49 and 0.65).]

This test had five discriminating items (Table 7) yielding an estimated test reliability of 0.50, almost twice that for the entire test of 21 items. If a test of 50 such items were used, the estimated test reliability would be expected to be 0.91. This qualifies for a standardized test! (A dash is shown where calculations yield meaningless results in Table 7.)

Test reliability then increases with test length and with difficult items that are also discriminating. Marking a difficult item correctly has the same weight as marking an easy item correctly in determining test reliability (same MSrow, 4.08). An item has the same difficulty wither marked right by an able student or by a less than able student (same MScolumn, 9.58).

The forerunner of Power Up Plus (PUP) was originally compared to other test scoring software to verify that it was producing correct results. PUP also produces the same test reliability estimate as Winsteps: 0.29.
- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):
- - - - - - - - - - - - - - - - - - - - -

I have included the following discussion of the analysis of variance (ANOVA) while I have test reliability in mind again. You can skip to the next post unless you have an interest in the details of test reliability that show some basic relationships between sums of squares (SS). Or put in other words, if I can solve the same problem in more than one way, I just might be right in interpreting the paper by Li and Wainer, 1998, Toward a Coherent View of Reliability in Test Theory.

The ANOVA (Hoyt, 1941) and Cronbach’s alpha (1951) produce identical test reliability results. The ANOVA however makes clear that an assumption must be made for this to happen (Li and Wainer, 1998). This assumption provides a view into the depths of psychometrics that I have little intention to explore. It seems that the KR20 (Kuder & Richardson, 1937) and alpha test reliability is not a point but a region. They underestimate test reliability. Their estimates fall at the lower boundary of the region. The MSwic of 2.96 may be an over estimate of error, resulting in a lower test reliability estimate (0.29).

How much difference this really makes will have to wait until I get further into this study or until a more informed person can help out. If the difference is similar to that produced by the correction for small samples in the MSwic, (2.96 to 3.10, 1/22, or about 5%) on Table 6, then it may have a practical effect and should not be ignored. This may become very important when we get to the next statistic, statistic five: Standard Error of Measurement. The SSwic is also labeled interaction, error, unexplained, rows within columns, scores by difficulties, and scores within difficulties.  

The MSwic (Interactions) is assumed to be the error term in the ANOVA. This is using a customary means for solving difficult statistical, engineering, and political problems; simplifying the problem by ignoring a variable that may have little effect. The ANOVA tables in Table 8 reflect my understanding from Li and Wainer, 1998. Some help would be appreciated here too.

I used the “ANOVA Calculation Using a Correction Factor” on the right side of Table 8 to verify the total SS, score SS, and error SS (74.28 = 4.28 + 70.00). The required SS error term for the KR20 (SSwic of 65.14) is then found at the bottom of Table 4 and at the bottom of Table 8 (Scores by Difficulties: 74.28 – 9.14 = 65.14).  The item column SScolumns is 9.14. The value 65.14 is then the common factor in the two methods that results in the same test reliability estimate.

The SSs and MSs in yellow are based on a scale of 0 to 1 with a mean of the Grand Mean: 0.799. The SSs and MSs in white are based on a normal item count scale. The note indicates how to convert from one scale to the other. This makes a handy check on the correctness of setting up the Excel spread sheet if you resize the central data field from 22 students by 21 items (also see the next post, Test Reliability Engine).

The F test is improved from 1.28 in the “Unexplained Student Score ANOVA Table” to 1.31 in the “Explained Student Score ANOVA Table.” Neither exceeds the critical value of 1.62. These answer mark data may result from luck on test day from many sources (student preparedness, selection of test items, testing environment, attitude, error in marking, chance, and etc.). The ANOVA table confirms a test reliability of 0.29 is low. The descriptive statistics are valid for this test, but no predictions can be made.

The SSwic Interactions (65.14) sums the variation in marks within each column [(=VAR.P(B5:B26) from B5 to V5) x 22 students]. The SSwir Interactions (70.00) sums the variation in marks within each row [(=VAR.P(B5:V5) from B5 to B26) x 21 items]. The cell Interactions, the total SS, (74.12) sum the variation in the item scores (0 and 1) within the full Guttman table [=VAR.P(B5:V26) x 462 marks].

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):

Wednesday, April 3, 2013

Visual Education Statistics - Standard Deviation

Statistic Three:  The standard deviation (SD) is an attempt to capture the distribution of scores (in a standardized way; visually, the normal curve in the prior post). The 22 scores in the Nursing124 data (Tables 2 and 3) have been sorted only to make the charts easier to follow (Chart  10).

 [Check Table 4 below, (PUP 5.20, Table 3c. Guttman Mark Matrix with Scores and Item Difficulty), for the values being plotted.]

[A Guttman table has the student scores sorted from high to low, vertically, and item difficulties sorted high to low, horizontally. The most difficult item and the lowest scoring student end up at the lower right corner of the table.]

The variation in the student scores is readily visible when the average score is added to the chart (Chart 11). It is this region of variation that is captured by the standard deviation (SD). You can add or subtract a constant number from every student score without changing the variation.

The deviation of each score from the average test score (the mean) is extracted from the scores and is plotted next (Chart 12). These values add to zero. The solution to this problem (back when this was all done with paper and pencil) was to square the deviations. Now the numbers being used to capture the variation in the scores are all positive and can be added (Chart 13). The sum of squares (SS) is 89.86.

About ½ of the SS is produced by just 3 of the most extreme scores out of the 22 total. It is a matter of personal judgment when to call an extreme score an outlier and remove it from further statistical analysis.

The calculation of the SD involves three steps shown on the right side of the Guttman table (Table 4): the sum of squared deviations (SS) [89.86], the mean of the sum of squares (Mean SS, MSS or Variance) [4.28], and the SD (Square Root of the Mean SS or the Variance) [2.07]. Each step has been given a name as in many calculations the SS or the Mean SS is used rather than developing the SD and then reversing the process as needed.

Standard here means done in a standardized way. The standard is that 2/3 (68.3%) of the student scores and item difficulties are expected to fall within the range of (+-) one SD of their average values. And 95.4% are expected to fall within the range of (+-) two SDs of their average values.

This standard generates the normal curve of error that in education is shortened to the normal curve or bell curve with both the meaning and use reversed. In the sciences and engineering, error is to be avoided. In education, error is used to produce the spread of scores needed to assign letter grades in schools designed for failure and to assign the pass/fail point on NCLB standardized assessments.

[This need for a wide distribution of scores for setting test grades may result in problems in establishing the precision of student scores. This needs to be checked when I get to the standard error of measurement, statistic #5, in this series.]

The above calculations are shown for N and for N – 1 when calculating the Mean SS. N – 1 is a correction for classroom sized test data, when the number of students and questions is below 100.

Table 4 shows the SD for student scores and for item difficulties. Item difficulties generally have a wider spread (3.17), a larger SD, than student scores (2.07) in classroom tests.

I have fully developed Table 4, Standard Deviation (SD) Calculations, as this shows the foundation for all of the remaining statistics in this series. The sum of squared deviations (SS) is 89.86 for student scores and 201.40 for item difficulties. Then dividing SS by the number summed produces the Mean SS or Variance. The square root of the Mean SS is the SD. This is totally reversible. Square the SD yields the Mean SS. Multiply the Mean SS by the number of summed to get back to the SS.

The above paragraph reports honest number manipulations. There is no room for bias in calculating the SS. However, the deviations squared (DEVSQR) are spread over a wider range than the actual deviations. A large SD can be due to an evenly distributed set of scores or to a narrow distribution with one or more outliers far from the average score. Two identical SDs may result from two very different distributions.

This can be a problem in classroom tests. Standardized tests reduce the problem by sampling a large enough number of students or items to get a stable distribution.

[The SS are also used in the analysis of variance (ANOVA) commonly used in the sciences and engineering. I never saw it used in education when I was teaching. The ANOVA permits one to determine if the distribution of marks in rows and columns is just a matter of luck or if there is something else at play. If the null hypothesis, that the distribution is no different than a matter of luck, holds, there is then no need to do any other statistical tests.

The calculations (in yellow) for the ANOVA have been added to the left and bottom edge of Table 4. The grand mean is 0.779 (based on the values of right and wrong marks, 1 and 0). The deviations squared (DEVSQR) for each score and difficulty are listed in respect to the grand mean. The total degrees of freedom is the count of cells in the table minus 1 (462 – 1).

The ANOVA (Table 5) yields Mean SS ratio between student score rows and unexplained (or error) within rows or between columns (0.20/0.16) for an F test value of 1.27. This value does not exceed the 5% level of significance F table for 21/440 degrees of freedom critical value of 1.59. The variation found in this table of student marks may be a matter of luck (student preparation and attitude, item authoring, test creator item selection, testing environment, and pure chance).] 

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):