Wednesday, April 10, 2013

Visual Education Statistics - Test Reliability


                                                               5

Statistic Four: An estimate of test reliability or reproducibility helps tune a test for a desired standard by replacing, adding, or removing items. This is helpful in the classroom. It is critical in the marketing of standardized tests. No one wants to buy an unreliable test. Time and money require the shortest test possible to meet desired standards.

There is no true standard available on which to base test reliability. The best that we can do is to use the current test score and it’s standard deviation (SD). The test score is as real as any athletic event score, or weather or stock market report. “Past performance is no guarantee of future performance.” The SD captures the score distribution in the form of the normal curve of error, as described in previous posts.

A Guttman table (Table 6) shows two ways to calculate the Mean Sum of Squares (MS) or Variance within item columns (2.96). The first uses the Mean SS as discussed in prior posts. [Mean Sum of Squares = Mean SS = Mean Square = MS = Variance] The second uses probabilities based on the difficulty of each item. The results are identical, 2.96 for large data sets (N) and 3.10 for classroom sized data sets (N – 1).

The KR20 and Cronbach’s alpha are then calculated using the ratio of the within item columns MS (2.96) to the student score row MS (4.08). [(21/20)*(1-(2.96/4.08) = 0.29] A test reliability of only 0.29 is very low.

The mean square within item columns MS (MSwic) must be relatively low to the student score row MS (MSrow) to obtain a high test reliability estimate.

But the more difficult an item is, the larger the contribution to the MSwic. The easiest item at 95% yields a Variance of 0.05. The most difficult item at 45% yields 0.25. To increase test reliability, the MSrow must increase, and the MSwic must decrease, in relation to one another.

The Unfinished and Discriminating items (Table 7) have similar difficulties: 73% and 71%. The test reliability increased from 0.29 to 0.47 when I deleted the eight (yellow) Unfinished items: 3, 7, 8, 9, 13, 17, 19, and 20 on Table 6. The MSwic fell 50% but MSrow fell only 36% to produce the increase in test reliability from 0.29 to 0.47. Getting rid of non-discriminating items helped.

A number of factors affect test reliability. Easy items (10, 12 and 21 in Table 6) yielded little to the Variance. We need easy items in the classroom to survey what students have mastered. Easy items are a waste of time and money on standardized tests designed only to rank students. Easy items do not spread out student scores. Easy items do little to support the student score MSrows.

This test only has 21 questions (Table 7). If the test had been 50 items long the estimated reliability would be 0.49, and with 100 items it would be 0.66. The test was too short using the current items. Doubling the length of this test (21 items to 42 items) by including a duplicate set of mark data increased the estimated test reliability from 0.29 to 0.65. MSwic doubled (twice as many items) but MSrow increased four times (the doubling of the score deviation was squared).

[There seems to be a discrepancy between the Spearman-Brown prediction formula in PUP 5.22 and the actual doubling of the length of this test with identical mark data on an Excel spreadsheet (22 to 50 students yields 0.29 to 0.49 compared to 22 to 44 students yields 0.29 to 0.65) That is, a lesser increase in students (27 and 22) produced a larger change in results (0.49 and 0.65).]

This test had five discriminating items (Table 7) yielding an estimated test reliability of 0.50, almost twice that for the entire test of 21 items. If a test of 50 such items were used, the estimated test reliability would be expected to be 0.91. This qualifies for a standardized test! (A dash is shown where calculations yield meaningless results in Table 7.)

Test reliability then increases with test length and with difficult items that are also discriminating. Marking a difficult item correctly has the same weight as marking an easy item correctly in determining test reliability (same MSrow, 4.08). An item has the same difficulty wither marked right by an able student or by a less than able student (same MScolumn, 9.58).

The forerunner of Power Up Plus (PUP) was originally compared to other test scoring software to verify that it was producing correct results. PUP also produces the same test reliability estimate as Winsteps: 0.29.
- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):
- - - - - - - - - - - - - - - - - - - - -

I have included the following discussion of the analysis of variance (ANOVA) while I have test reliability in mind again. You can skip to the next post unless you have an interest in the details of test reliability that show some basic relationships between sums of squares (SS). Or put in other words, if I can solve the same problem in more than one way, I just might be right in interpreting the paper by Li and Wainer, 1998, Toward a Coherent View of Reliability in Test Theory.

The ANOVA (Hoyt, 1941) and Cronbach’s alpha (1951) produce identical test reliability results. The ANOVA however makes clear that an assumption must be made for this to happen (Li and Wainer, 1998). This assumption provides a view into the depths of psychometrics that I have little intention to explore. It seems that the KR20 (Kuder & Richardson, 1937) and alpha test reliability is not a point but a region. They underestimate test reliability. Their estimates fall at the lower boundary of the region. The MSwic of 2.96 may be an over estimate of error, resulting in a lower test reliability estimate (0.29).

How much difference this really makes will have to wait until I get further into this study or until a more informed person can help out. If the difference is similar to that produced by the correction for small samples in the MSwic, (2.96 to 3.10, 1/22, or about 5%) on Table 6, then it may have a practical effect and should not be ignored. This may become very important when we get to the next statistic, statistic five: Standard Error of Measurement. The SSwic is also labeled interaction, error, unexplained, rows within columns, scores by difficulties, and scores within difficulties.  

The MSwic (Interactions) is assumed to be the error term in the ANOVA. This is using a customary means for solving difficult statistical, engineering, and political problems; simplifying the problem by ignoring a variable that may have little effect. The ANOVA tables in Table 8 reflect my understanding from Li and Wainer, 1998. Some help would be appreciated here too.

I used the “ANOVA Calculation Using a Correction Factor” on the right side of Table 8 to verify the total SS, score SS, and error SS (74.28 = 4.28 + 70.00). The required SS error term for the KR20 (SSwic of 65.14) is then found at the bottom of Table 4 and at the bottom of Table 8 (Scores by Difficulties: 74.28 – 9.14 = 65.14).  The item column SScolumns is 9.14. The value 65.14 is then the common factor in the two methods that results in the same test reliability estimate.

The SSs and MSs in yellow are based on a scale of 0 to 1 with a mean of the Grand Mean: 0.799. The SSs and MSs in white are based on a normal item count scale. The note indicates how to convert from one scale to the other. This makes a handy check on the correctness of setting up the Excel spread sheet if you resize the central data field from 22 students by 21 items (also see the next post, Test Reliability Engine).

The F test is improved from 1.28 in the “Unexplained Student Score ANOVA Table” to 1.31 in the “Explained Student Score ANOVA Table.” Neither exceeds the critical value of 1.62. These answer mark data may result from luck on test day from many sources (student preparedness, selection of test items, testing environment, attitude, error in marking, chance, and etc.). The ANOVA table confirms a test reliability of 0.29 is low. The descriptive statistics are valid for this test, but no predictions can be made.

The SSwic Interactions (65.14) sums the variation in marks within each column [(=VAR.P(B5:B26) from B5 to V5) x 22 students]. The SSwir Interactions (70.00) sums the variation in marks within each row [(=VAR.P(B5:V5) from B5 to B26) x 21 items]. The cell Interactions, the total SS, (74.12) sum the variation in the item scores (0 and 1) within the full Guttman table [=VAR.P(B5:V26) x 462 marks].

- - - - - - - - - - - - - - - - - - - - - 

Free software to help you and your students experience and understand the change from TMC to KJS (tricycle to bicycle):


No comments:

Post a Comment