Chi-square Goodness of Fit Test

The chi-square goodness-of-fit test is a theory-based test of significance that compares observed data to a prescribed model. We can use goodness-of-fit tests to compare the distribution of a categorical variable to a hypothesized distribution. For example, in Chapter 1, we conducted tests to evaluate questions like “When playing the game rock- paper-scissors (see Example 1.2), do novice players play scissors less than one-third of the time in the long run?” These tests evaluated whether sample data provided evidence that the probabilities associated with a categorical variable with two categories (scissors or not scissors) differed from hypothesized probabilities (π and 1 – π). However, at that time we were limited to only evaluating categorical variables with two categories. As we’ve been exploring in Chapter 8, it is often the case that categorical variables have more than two categories. So, for example, maybe you want to test whether the way novice players make choices when playing the game rock- paper-scissors is such that all three options are chosen equally often in the long run (1/3, 1/3, 1/3). Similarly, we could ask: Are birthdays equally distributed across the seven days of the week? Do certain pea plants produce three times as many purple flowers as white flowers? We’ll see how to evaluate these types of questions in this section.

Fair Die roll?

A statistics student wondered whether rolling six-sided dice by pushing them off a 2-inch ledge was a fair way of rolling dice.

In this case, for a die roll to be fair, it means that all six sides are equally likely to occur. In other words, if we were to repeatedly roll the die by pushing it off a 2-inch high ledge, then we would expect each of the 6 numbered faces of the die to appear on top an equal number of times in the long run. If we rolled the die 120 times, we would expect to see each of the 6 different numbers rolled about 20 times. After rolling the die 120 times, if we observe our data to have deviated substantially from what we expected to see from a fair die (~20 times each), we may have evidence that our rolling method is not “fair.”

\[H_0: \pi_1 = \pi_2 = \pi_3 = \pi_4 = \pi_5 = \pi_6 = \frac{1}{6}, \textrm{so the method of rolling is fair. }\] \[H_a: \textrm{ At least one face of the die is likely to appear more than 1/6 of the time, so the method of rolling is not fair.}\] where \(\pi_i\) is the long-run proportion of rolling the number \(i\), for \(i =1, 2, 3, 4, 5, 6.\)

Notice how these hypotheses looks familiar — testing the equality of several probabilities — but a key difference is that we are now specifying a specific numerical value for each \(\pi_i\) (and those values must sum to 1).

Load packages and data

#add the other package that we will need
library(ggformula)
library(mosaic)

We’ll load the data, Die.csv, available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/Die.csv. Since this is a csv fil we’ll use the read.csv() function to read in the data.

DieData <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/Die.csv")

We can view the data by creating a bar graph of the frequency each number is rolled.

gf_bar(~DieRoll, data=DieData, xlab="Number", ylab="Frequency of number", title = "Number of times each face appeared in the sample")

We can also view the data by creating a table of counts for the six faces of the die.

tally( ~DieRoll, data=DieData)
## DieRoll
##  1  2  3  4  5  6 
## 19 10 31 17 15 28

The MAD Statistic and Simulation-Based Approach.

The Mean Absolute Difference, MAD, is a statistic measuring how far the the observed counts are away from the hypothesized count (expected count under the null). In this study the hypothesized count is 20. The Absolute Difference part of the calculation is the absolute value of the difference between each observed count and the expected count. Then we take the Mean of those differences. Here is an example for our study.

MAD = (abs(19-20)+abs(10-20)+abs(31-20)+abs(17-20)+abs(15-20)+abs(28-20))/6
MAD
## [1] 6.333333

Interpretation: Our observed frequencies of the numbers rolled are on average 6.33 units away from what is expected.

Go to the Multiple Proportions simulation-based applet to complete the analysis using the MAD statistic.

p-value from Simulation using MAD statistic: 0.0046 (46/10000)

p-value from Simulation using \(\chi^2\) statistic: 0.0061 (61/10000)

The Chi-squared distribution is actually a family of distributions that changes shape according to a variable called degrees of freedom denoted by k.

  • For a Chi-squared hypothesis test of multiple proportions (as seen in Chapter 8, Section 2) the degrees of freedom are computed by multiplying the number of categories in the explanatory variable minus 1 by the number of categories in the response variable minus 1.
  • For a Chi-squared Goodness of Fit test (as seen here and in Chapter 8, Section 3) the degrees of freedom are one less than the number of proportions in the model. In our study of whether pushing the die off a ledge is a fair roll we have six possible categories for the roll, so k=6-1=5.
Chi-square distributions
Chi-square distributions

Calculation of the Chi-square statistic

\[ \chi^2 = \Sigma \frac{(\textrm{observed count } - \textrm{ expected count})^2}{\textrm{expected count}}\]

tally(~DieRoll, data=DieData)
## DieRoll
##  1  2  3  4  5  6 
## 19 10 31 17 15 28
#expected count value is (1/6)120 = 20 
chiSquare <- ((19-20)^2)/20 + ((10-20)^2)/20 + ((31-20)^2)/20 +((17-20)^2)/20 + ((15-20)^2)/20 + ((28-20)^2)/20
chiSquare
## [1] 16

Validity Conditions for Theory-Based Chi-squared tests

As in our other theory-based tests, this one comes with validity conditions as well. The validity conditions for a chi-square goodness-of-fit test are that all observed counts are at least 10. Since the values 19, 10, 31, 17, 15, 28 are all larger than 10 the validity conditions have been met for this study.

Theory-Based Chi-square Goodness of Fit test

We use the function chisq.test( ) to perform the goodness of fit test. The input of this function is a list of observed counts and a list of predicted (or expected) proportions. Note that the list of expected proportions must add up to 1.

If we have a data file, we can input the list of observed counts using the tally( ) function. If we have only a table of counts, we can input the list of counts like this c( n1, n2, n3, n4, n5, n6) using the numbers n1, n2, etc. from our study.

To input the list of expected proportions we write p = c( p1, p2, p3, p4, p5, p6) where p1 is the expected proportion for the count n1, p2 is the expected proportion for the count n2, and so on. Note that when the proportions are not all equal, then the order in which they are listed matters!

# GoodnessFit
#chi-square using data file and expected proportion
chisq.test(tally(~DieRoll, data = DieData), p = c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)) 
## 
##  Chi-squared test for given probabilities
## 
## data:  tally(~DieRoll, data = DieData)
## X-squared = 16, df = 5, p-value = 0.006844
#chi-square from list of counts and expected proportion
chisq.test(c(19, 10, 31, 17, 15, 28), p=c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6))
## 
##  Chi-squared test for given probabilities
## 
## data:  c(19, 10, 31, 17, 15, 28)
## X-squared = 16, df = 5, p-value = 0.006844

Conclusion: Our theory-based p-value 0.0068 is very similar to both of our simulation-based p-values. From this we conclude that we have strong evidence that pushing dice off a 2-inch ledge is not a fair way of “rolling” dice.

The study design is fairly well controlled because the same ledge was used for each trial, the same lineup of dice was used for each trial, and the same person did the pushing of the dice each trial. Thus, it seems reasonable to believe that the method of dice rolling, pushing them off a 2-inch ledge, is the cause of the faces not being rolled equally likely. We weren’t, however, told whether our five dice were a random sample of all dice. It is plausible that they were taken from a generic board game. Then we could argue they are representative of all game board dice of the same size and thus say that this method of dice rolling is not fair for all game board dice of this specific size. We need more information, though, to extend our conclusions beyond the five dice and 2-inch height used. Additionally, we would need the assurance that these five dice weren’t loaded prior to conducting our study!