Setup and packages

We use two packages in this course: mosaic and ggformula. To load a package, you use the library() function, wrapped around the name of a package. I’ve put the code to load one package into the chunk below. Add the other two you need.

library(mosaic)
library(ggformula)
# put in the other package you need here

Loading in data

We will load the example data, GSS22clean.csv, from the url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv

Recall, this dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use.

We’ll use the read.csv() function to read in the data. Remember that quotes are required around the url address.

GSS22 <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv")

Inference for many means (ANOVA)

We’ve seen how to do inference for a single mean, and for a difference of means. Now, we’ll study how to do inference on many means. To do this, we’ll use a procedure called ANOVA for ANalysis Of VAriance. We’re comparing the variability within groups to the variability between groups using the F-statistic.

The ANOVA test is a hypothesis testing (no confidence intervals until after ANOVA) and the hypotheses are always of this form:

\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \\ H_A: \text{At least one of the means is different} \]

Let’s consider the variables marital_status and number_of_hours_worked_last_week. We will investigate whether or not there is an association between marital status and the mean number of hours worked per week.

View the Data

#filtering out NA values
GSS22 <- filter(GSS22, !is.na(marital_status))
GSS22 <- filter(GSS22, !is.na(hours_worked_last_week))

You should notice that the number of observational units decreased from 3544 to 1933, meaning that 1611 observational units have no recorded data for either martial_status or hours_worked_last_week or both.

Make a set of side-by-side boxplots to show the relationship between those two variables. Note you might have to filter the data to remove NA values. Use the all_the_R_so_far_M138.Rmd file if you don’t remember the code.

#side-by-side boxplots
gf_boxplot(hours_worked_last_week ~ marital_status, data= GSS22)

Another graph to look for skewness or symmetric data is gf_jitter()

#a jitterplot 
gf_jitter(hours_worked_last_week ~ marital_status, data= GSS22, alpha = 0.3, width=0.2)

Specifying the null and alternative hypotheses

How many groups do we have here? Five.

So our hypotheses here are

\[ H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \\ H_a: \text{At least one of the means is different} \] or, if we wanted to name the means according to the group names,

\[ H_0: \mu_{\text{divorced}} = \mu_{\text{married}} = \mu_{\text{never married}} = \mu_{\text{separated}} = \mu_{\text{widowed}} \\ H_a: \text{At least one of the means is different} \]

Validity Conditions

To perform ANOVA in R, we can use the built-in R function aov(), which will use an F distribution as the distributional approximation for the null distribution. In order to use this distributional approximation, two conditions must be met:

  • Sample size/shape: Either the sample size is at least 20 for all the groups, without strong skewness or outliers; or if the sample size is less than 20, then the distribution of the response variable is approximately symmetric in all the samples (examine the dotplots for strong skewness or outliers)

  • SD: the variances of each group should be approximately equal. In practice, we check this by determining if \(sd_{max}< 2\times sd_{min}\).

How can we check those conditions?

#calculating our favorite statistics for each category of marital status
favstats(hours_worked_last_week ~ marital_status, data= GSS22)
##   marital_status min Q1 median   Q3 max     mean       sd   n missing
## 1       divorced   0 37     40 48.0  89 41.25890 14.41447 309       0
## 2        married   0 40     40 48.0  89 40.89529 14.19947 850       0
## 3  never married   0 35     40 45.0  89 39.15237 13.42575 676       0
## 4      separated   8 40     40 46.5  89 41.51064 14.38730  47       0
## 5        widowed   0 19     40 40.0  89 34.07843 17.67579  51       0
#another option to check for strong skewness
Married <- filter(GSS22, marital_status=='married')
gf_histogram(~hours_worked_last_week, data=Married, title="Number of hours worked last week by Married people")

Can you create histograms for the hours worked last week of the other groups?

#another option to check for strong skewness
Divorced <- filter(GSS22, marital_status=='divorced')
gf_histogram(~hours_worked_last_week, data=Divorced, title="Number of hours worked last week by Divorced people")

#another option to check for strong skewness
NeverMarried <- filter(GSS22, marital_status=='never married')
gf_histogram(~hours_worked_last_week, data=NeverMarried, title="Number of hours worked last week by Never Married people")

#another option to check for strong skewness
Separated <- filter(GSS22, marital_status=='separated')
gf_histogram(~hours_worked_last_week, data=Separated, title="Number of hours worked last week by Separated people")

#another option to check for strong skewness
Widowed <- filter(GSS22, marital_status=='widowed')
gf_histogram(~hours_worked_last_week, data=Widowed, title="Number of hours worked last week by Widowed people")

Checking the Validity Conditions

The group sizes are 850, 309, 676, 47, and 51, all of which are greater than 20. The jitter and histogram plots below show no more than moderate skewness.

favstats(hours_worked_last_week~ marital_status, data=GSS22)
##   marital_status min Q1 median   Q3 max     mean       sd   n missing
## 1       divorced   0 37     40 48.0  89 41.25890 14.41447 309       0
## 2        married   0 40     40 48.0  89 40.89529 14.19947 850       0
## 3  never married   0 35     40 45.0  89 39.15237 13.42575 676       0
## 4      separated   8 40     40 46.5  89 41.51064 14.38730  47       0
## 5        widowed   0 19     40 40.0  89 34.07843 17.67579  51       0

From the favstats calculations we see the minimum sd is 13.4 and the maximum sd of 17.6 is well below 2(13.4) = 26.8.

Thus, validity conditions are satisfied.

ANOVA test

To perform the ANOVA hypothesis test we use the aov( ) function. The input has two main parts: response_variable ~ explanatory_variable and data=NAME_OF_DATA_FILE. The aov( ) function calculates and stores many pieces of information. We will limit our view to a summary display of aov(). To view the summary we will name the output (in the code below I’ve named the result a1) and then look at the summary of the output.

a1 <- aov(hours_worked_last_week ~ marital_status, data = GSS22)

Now, let’s run summary() on that object,

summary(a1)
##                  Df Sum Sq Mean Sq F value  Pr(>F)   
## marital_status    4   3490   872.6   4.404 0.00151 **
## Residuals      1928 381988   198.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the degrees of freedom for the groups? 4

The total degrees of freedom? 1932

The degrees of freedom for the error? 1932-4 = 1928

What is the test statistic? F=4.04

What is the p-value? 0.00151

What is our general conclusion at the \(\alpha=0.05\) level? Very strong evidence against the null. We conclude that the mean number of hours worked last week is different for at least one of the five marital status groups. We follow this analysis with pairwise confidence intervals for the difference of two means to find any significant differences.

Inference after ANOVA

We will create pairwise confidence intervals in R all at once. You may recall from class that doing many tests at once can lead to a problem of multiple comparisons which compound the likelihood of a Type I error. One solution to the problem of multiple comparisons is the Bonferroni correction, which essentially divides the \(\alpha\) cutoff value by the number of tests you are going to run. However, there are other, more sophisticated methods. One of these is called Tukey’s Honest Significant Difference, which we will use here.

Let’s create an 95% confidence interval for the difference between the number of hours worked by widowed and married people.

TukeyHSD(a1, conf.level = 0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = hours_worked_last_week ~ marital_status, data = GSS22)
## 
## $marital_status
##                               diff        lwr        upr     p adj
## married-divorced        -0.3636056  -2.916565  2.1893541 0.9951658
## never married-divorced  -2.1065328  -4.745637  0.5325710 0.1879153
## separated-divorced       0.2517386  -5.765369  6.2688461 0.9999616
## widowed-divorced        -7.1804683 -12.989154 -1.3717828 0.0067398
## never married-married   -1.7429273  -3.723475  0.2376209 0.1149329
## separated-married        0.6153442  -5.143413  6.3741014 0.9984208
## widowed-married         -6.8168627 -12.357488 -1.2762374 0.0071205
## separated-never married  2.3582714  -3.439189  8.1557321 0.8010060
## widowed-never married   -5.0739355 -10.654777  0.5069063 0.0950258
## widowed-separated       -7.4322069 -15.203083  0.3386687 0.0686741

95% confidence intervals: (-12.989154, -1.3717828) for difference in means \(\mu_{widowed}-\mu_{divorced}\).

(-12.357488 -1.2762374) for difference in means \(\mu_{widowed}-\mu_{married}\).

Conclusions

Significance: Our overall ANOVA test for association between mean number of hours worked and marital status shows very strong evidence (p-value = 0.00151) of a significant difference in mean number of hours worked for at least one of the five groups: married, never married, divorced, separated, widowed.

Following up with the pairwise tests we find two significant differences:

  • The 95% confidence interval for the difference between the number of hours worked by widowed and divorced people is (-12.989154, -1.3717828), meaning that we are 95% confidence that the average hours worked by divorced people is 1.37 to 12.9 hours more than the average hours worked by people that are widowed.

  • The 95% confidence interval for the difference between the number of hours worked by widowed and married people is (-12.357488 -1.2762374), meaning that we are 95% confidence that divorced people work on average 1.27 to 12.35 hours more than widowed people.

  • All other pairwise comparisons do not show a significant difference in mean number of hours worked. This is perhaps a bit surprising as the difference between widowed and separated groups resulted in the largest difference in means. However, these two groups also have the smallest sample sizes of 51 and 47 respectively. Recall, that when sample sizes are smaller the width of the confidence interval is larger. The larger width in confidence interval is seen using Tukey’s Honest Significant Difference a hypothesis test results in a p-value of 0.06 (not significant at the 0.05 level) and a 95% confidence interval that just barely includes 0.

Generalization: Since the GSS is a random sample of US households, we can generalize our findings. In other words, the households that consist of widowed people work significantly less than households with married and divorced households. This should make sense since people that are widowed are likely to be older and thus retired.

Causation: Since this is an observational study we cannot conclude any type of causal relationship between marital status and number of hours worked last week.