Lab 7: ANOVA, Inference for Multiple Means

Setup and packages

We use two packages in this course: mosaic and ggformula. To load a package, you use the library() function, wrapped around the name of a package. I’ve put the code to load one package into the chunk below. Add the other two you need.

library(mosaic)
library(ggformula)
# put in the other package you need here

Loading in data

We will load the example data, GSS_clean.csv, from the url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv

Recall, this dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use.

We’ll use the read.csv() function to read in the data. Remember that quotes are required around the url address.

GSS <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv")

Inference for many means (ANOVA)

We’ve seen how to do inference for a single mean, and for a difference of means. Now, we’ll study how to do inference on many means. To do this, we’ll use a procedure called ANOVA for ANalysis Of VAriance. We’re comparing the variability within groups to the variability between groups using the F-statistic.

When we’re doing ANOVA, we just do hypothesis testing (no confidence intervals until after ANOVA), and the hypotheses are always the same:

\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \\ H_A: \text{At least one of the means is different} \]

Let’s consider the variables marital_status and number_of_hours_worked_last_week.

Make a set of side-by-side boxplots to show the relationship between those two variables. Note you might have to filter the data to remove NA values. Use the all_the_R_so_far_M138.Rmd file if you don’t remember the code.

#filtering out NA values
GSS <- filter(GSS, !is.na(marital_status))
GSS <- filter(GSS, !is.na(number_of_hours_worked_last_week))

#side-by-side boxplots
gf_boxplot(number_of_hours_worked_last_week ~ marital_status, data= GSS)

Another graph to look for skewness or symmetric data is gf_jitter()

#a jitterplot 
gf_jitter(number_of_hours_worked_last_week ~ marital_status, data= GSS, alpha = 0.3, width=0.2)

Specifying the null and alternative hypotheses

How many groups do we have here? Five.

So our hypotheses are technically

\[ H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \\ H_A: \text{At least one of the means is different} \] or, if we wanted to get specific,

\[ H_0: \mu_{\text{divorced}} = \mu_{\text{married}} = \mu_{\text{never married}} = \mu_{\text{separated}} = \mu_{\text{widowed}} \\ H_A: \text{At least one of the means is different} \]

Validity Conditions

To do ANOVA in R, we can use the built-in R function aov(), which will use an F distribution as the distributional approximation for the null distribution. In order to use this distributional approximation, two conditions must be met:

Either the sample size is at least 20 for all the groups, without strong skewness or outliers; or if the sample size is less than 20, then the distribution of the response variable is approximately symmetric in all the samples (examine the dotplots for strong skewness or outliers)
the variances of each group should be approximately equal. In practice, we check this by determining if \(sd_{max}< 2\times sd_{min}\).

How can we check those conditions?

#calculating our favorite statistics for each category of marital status
favstats(number_of_hours_worked_last_week ~ marital_status, data= GSS)

##   marital_status min   Q1 median   Q3 max     mean       sd   n missing
## 1       Divorced   1 35.0     40 50.0  89 42.04184 14.74340 239       0
## 2        Married   1 37.0     40 50.0  89 41.48220 14.63485 618       0
## 3  Never married   1 35.0     40 48.0  89 40.96744 13.71926 430       0
## 4      Separated  15 37.5     40 54.5  72 44.20930 14.24205  43       0
## 5        Widowed   2 25.0     40 43.5  86 35.47059 16.68455  51       0

#another option to check for strong skewness
Married <- filter(GSS, marital_status=='Married')
gf_histogram(~number_of_hours_worked_last_week, data=Married, title="Number of hours worked last week by Married people")

Can you create histograms for the hours worked last week of the other groups?

Checking the Validity Conditions

The group sizes are 239, 618, 430, 43, and 51, all of which are greater than 20. The jitter and histogram plots below shows no more than moderate skewness and the boxplot shows outliers both above and below the IQR.

From the favstats calculations we see the minimum sd is 13.7 and the maximum sd of 16.7 is well below 2(13.7) = 27.4.

Thus, validity conditions are satisfied.

Next we can use the aov() command. It works best if you save the result from aov() into a named R object. In the code below I’ve named the result a1.

Try running the following code. What happens in your Environment?

a1 <- aov(number_of_hours_worked_last_week ~ marital_status, data = GSS)

Now, let’s run summary() on that object,

summary(a1)

##                  Df Sum Sq Mean Sq F value Pr(>F)  
## marital_status    4   2296   574.0   2.752 0.0269 *
## Residuals      1376 287065   208.6                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the degrees of freedom for the groups? The degrees of freedom for the error? The total degrees of freedom?

What is the test statistic? What is the p-value?

What is our general conclusion at the \(\alpha=0.05\) level?

Inference after ANOVA

Because we found a significant p-value and concluded that at least one of the means is different, we can do inference after ANOVA.

The easiest way to do pairwise tests in R is to do them all at once. You may recall from lecture that doing many tests at once can lead to a problem of multiple comparisons.

One solution to the problem of multiple comparisons is called the Bonferroni correction, which is where you essentially divide your \(\alpha\) cutoff number by the number of tests you are going to run. But, there are other, more sophisticated methods. One of these is called Tukey’s Honest Significant Difference, which we will use here.

Let’s create an 90% confidence interval for the difference between the number of hours worked by widowed and married people.

TukeyHSD(a1, conf.level = 0.9)

##   Tukey multiple comparisons of means
##     90% family-wise confidence level
## 
## Fit: aov(formula = number_of_hours_worked_last_week ~ marital_status, data = GSS)
## 
## $marital_status
##                               diff        lwr       upr     p adj
## Married-Divorced        -0.5596404  -3.268439  2.149159 0.9865273
## Never married-Divorced  -1.0743991  -3.943590  1.794792 0.8884886
## Separated-Divorced       2.1674613  -3.723293  8.058216 0.8947048
## Widowed-Divorced        -6.5712528 -12.056479 -1.086027 0.0267860
## Never married-Married   -0.5147588  -2.747980  1.718462 0.9797223
## Separated-Married        2.7271017  -2.881464  8.335668 0.7530687
## Widowed-Married         -6.0116124 -11.192609 -0.830616 0.0351633
## Separated-Never married  3.2418605  -2.445905  8.929626 0.6256307
## Widowed-Never married   -5.4968536 -10.763483 -0.230224 0.0766051
## Widowed-Separated       -8.7387141 -16.101195 -1.376233 0.0290448

90% confidence interval (-11.192609, -0.830616) for difference in means \(\mu_{widowed}-\mu_{married}\).

Conclusions

Significance: Our overall test for association between mean number of hours worked and marital status shows strong evidence, a p-value = 0.0269, that there is a significant difference in mean number of hours worked for at least one of the five groups: married, never married, divorced, separated, widowed.

Following up with the pairwise tests we find the 90% confidence interval for the difference between the number of hours worked by widowed and married people is

(-11.192609, -0.830616)

Meaning that we are 90% confidence that the average hours worked by married people is 0.83 to 11.19 hours longer than the average hours worked by people that are widowed.

The significant comparisons all involve the widowed group of people. There is a significant difference in mean number of hours worked between the groups widowed vs divorced, widowed vs married, and widowed vs separated.
However, the difference in mean number of hours \(\mu_{widowed} - \mu_{never-married} = -5.5\) is not significant at the 0.05 level.

Generalization: Since the GSS is a random sample of US households, we can generalize our findings. In other words, the households that consist of married people, divorced people, and separated people work more hours on average than households with widowed people. This should make sense since people that are widowed are likely to be older and thus retired!

Causation: Since this is an observational study we cannot conclude any type of causal relationship between marital status and number of hours worked last week.

Exercises

A British study (North, Shilcock, and Hargreaves, 2003) examined whether the type of background music playing in a restaurant affected the amount of money that diners spent on their meals. The researchers asked a restaurant to alternate silence, popular music, and classical music on successive nights over 18 days. Each type of music was played for six nights (the order was randomly determined to guard against confounding). The data file RestaurantMusic gives the type of music played and the amount spent on food and drinks (in British pounds) for each customer.

The data can be found at this url: http://www.isi-stats.com/isi/data/chap9/RestaurantMusic.txt

Load the data
Create side-by-side boxplots to show the relationship between the music played and the amount of money diners spent.
Calculate and report the sample means and standard deviations for the amount spent on food and drinks for each music type.

Classical:

Pop:

None:

Construct dotplots, for each of the three different treatments of music, of amounts spent on food and drink. Include appropriate titles for each plot.
State and check validity conditions to see whether you can run a theory-based ANOVA test. Be sure to explain how you are checking.

Validity Conditions:

If appropriate, carry out the theory-based test of significance. Be sure to report the p-value.

p-value:

If a significant result was found, construct additional follow-up confidence intervals to see where differences between the groups are.
Write up your conclusions regarding significance with context. Report and interpret any pairwise 95% confidence intervals. Address applicable causation and generalization conclusions that can be made for this study.

Significance:

Confidence intervals:

Causation:

Generalization: