We use two packages in this course: mosaic
and
ggformula
. To load a package, you use the
library()
function, wrapped around the name of a package.
I’ve put the code to load one package into the chunk below. Add the
other two you need.
library(mosaic)
library(ggformula)
# put in the other package you need here
We will load the example data, GSS_clean.csv
, from the
url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv
Recall, this dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use.
We’ll use the read.csv()
function to read in the data.
Remember that quotes are required around the url address.
<- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv") GSS
We’ve seen how to do inference for a single mean, and for a difference of means. Now, we’ll study how to do inference on many means. To do this, we’ll use a procedure called ANOVA for ANalysis Of VAriance. We’re comparing the variability within groups to the variability between groups using the F-statistic.
When we’re doing ANOVA, we just do hypothesis testing (no confidence intervals until after ANOVA), and the hypotheses are always the same:
\[ H_0: \mu_1 = \mu_2 = \dots = \mu_k \\ H_A: \text{At least one of the means is different} \]
Let’s consider the variables marital_status
and
number_of_hours_worked_last_week
.
Make a set of side-by-side boxplots to show the relationship between
those two variables. Note you might have to filter the data to remove NA
values. Use the all_the_R_so_far_M138.Rmd
file if you don’t
remember the code.
#filtering out NA values
<- filter(GSS, !is.na(marital_status))
GSS <- filter(GSS, !is.na(number_of_hours_worked_last_week)) GSS
#side-by-side boxplots
gf_boxplot(number_of_hours_worked_last_week ~ marital_status, data= GSS)
Another graph to look for skewness or symmetric data is
gf_jitter()
#a jitterplot
gf_jitter(number_of_hours_worked_last_week ~ marital_status, data= GSS, alpha = 0.3, width=0.2)
How many groups do we have here? Five.
So our hypotheses are technically
\[ H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \\ H_A: \text{At least one of the means is different} \] or, if we wanted to get specific,
\[ H_0: \mu_{\text{divorced}} = \mu_{\text{married}} = \mu_{\text{never married}} = \mu_{\text{separated}} = \mu_{\text{widowed}} \\ H_A: \text{At least one of the means is different} \]
To do ANOVA in R, we can use the built-in R function
aov()
, which will use an F distribution as the
distributional approximation for the null distribution. In order to use
this distributional approximation, two conditions must be met:
Either the sample size is at least 20 for all the groups, without strong skewness or outliers; or if the sample size is less than 20, then the distribution of the response variable is approximately symmetric in all the samples (examine the dotplots for strong skewness or outliers)
the variances of each group should be approximately equal. In practice, we check this by determining if \(sd_{max}< 2\times sd_{min}\).
How can we check those conditions?
#calculating our favorite statistics for each category of marital status
favstats(number_of_hours_worked_last_week ~ marital_status, data= GSS)
## marital_status min Q1 median Q3 max mean sd n missing
## 1 Divorced 1 35.0 40 50.0 89 42.04184 14.74340 239 0
## 2 Married 1 37.0 40 50.0 89 41.48220 14.63485 618 0
## 3 Never married 1 35.0 40 48.0 89 40.96744 13.71926 430 0
## 4 Separated 15 37.5 40 54.5 72 44.20930 14.24205 43 0
## 5 Widowed 2 25.0 40 43.5 86 35.47059 16.68455 51 0
#another option to check for strong skewness
<- filter(GSS, marital_status=='Married')
Married gf_histogram(~number_of_hours_worked_last_week, data=Married, title="Number of hours worked last week by Married people")
Can you create histograms for the hours worked last week of the other groups?
Checking the Validity Conditions
The group sizes are 239, 618, 430, 43, and 51, all of which are greater than 20. The jitter and histogram plots below shows no more than moderate skewness and the boxplot shows outliers both above and below the IQR.
From the favstats calculations we see the minimum sd is 13.7 and the maximum sd of 16.7 is well below 2(13.7) = 27.4.
Thus, validity conditions are satisfied.
Next we can use the aov()
command. It works best if you
save the result from aov()
into a named R object. In the
code below I’ve named the result a1
.
Try running the following code. What happens in your Environment?
<- aov(number_of_hours_worked_last_week ~ marital_status, data = GSS) a1
Now, let’s run summary()
on that object,
summary(a1)
## Df Sum Sq Mean Sq F value Pr(>F)
## marital_status 4 2296 574.0 2.752 0.0269 *
## Residuals 1376 287065 208.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What are the degrees of freedom for the groups? The degrees of freedom for the error? The total degrees of freedom?
What is the test statistic? What is the p-value?
What is our general conclusion at the \(\alpha=0.05\) level?
Because we found a significant p-value and concluded that at least one of the means is different, we can do inference after ANOVA.
The easiest way to do pairwise tests in R is to do them all at once. You may recall from lecture that doing many tests at once can lead to a problem of multiple comparisons.
One solution to the problem of multiple comparisons is called the Bonferroni correction, which is where you essentially divide your \(\alpha\) cutoff number by the number of tests you are going to run. But, there are other, more sophisticated methods. One of these is called Tukey’s Honest Significant Difference, which we will use here.
Let’s create an 90% confidence interval for the difference between the number of hours worked by widowed and married people.
TukeyHSD(a1, conf.level = 0.9)
## Tukey multiple comparisons of means
## 90% family-wise confidence level
##
## Fit: aov(formula = number_of_hours_worked_last_week ~ marital_status, data = GSS)
##
## $marital_status
## diff lwr upr p adj
## Married-Divorced -0.5596404 -3.268439 2.149159 0.9865273
## Never married-Divorced -1.0743991 -3.943590 1.794792 0.8884886
## Separated-Divorced 2.1674613 -3.723293 8.058216 0.8947048
## Widowed-Divorced -6.5712528 -12.056479 -1.086027 0.0267860
## Never married-Married -0.5147588 -2.747980 1.718462 0.9797223
## Separated-Married 2.7271017 -2.881464 8.335668 0.7530687
## Widowed-Married -6.0116124 -11.192609 -0.830616 0.0351633
## Separated-Never married 3.2418605 -2.445905 8.929626 0.6256307
## Widowed-Never married -5.4968536 -10.763483 -0.230224 0.0766051
## Widowed-Separated -8.7387141 -16.101195 -1.376233 0.0290448
90% confidence interval (-11.192609, -0.830616) for difference in means \(\mu_{widowed}-\mu_{married}\).
Significance: Our overall test for association between mean number of hours worked and marital status shows strong evidence, a p-value = 0.0269, that there is a significant difference in mean number of hours worked for at least one of the five groups: married, never married, divorced, separated, widowed.
Following up with the pairwise tests we find the 90% confidence interval for the difference between the number of hours worked by widowed and married people is
(-11.192609, -0.830616)
Meaning that we are 90% confidence that the average hours worked by married people is 0.83 to 11.19 hours longer than the average hours worked by people that are widowed.
The significant comparisons all involve the widowed group of people.
There is a significant difference in mean number of hours worked between
the groups widowed vs divorced, widowed vs married, and widowed vs
separated.
However, the difference in mean number of hours \(\mu_{widowed} - \mu_{never-married} =
-5.5\) is not significant at the 0.05 level.
Generalization: Since the GSS is a random sample of US households, we can generalize our findings. In other words, the households that consist of married people, divorced people, and separated people work more hours on average than households with widowed people. This should make sense since people that are widowed are likely to be older and thus retired!
Causation: Since this is an observational study we cannot conclude any type of causal relationship between marital status and number of hours worked last week.
A British study (North, Shilcock, and Hargreaves, 2003) examined
whether the type of background music playing in a restaurant affected
the amount of money that diners spent on their meals. The researchers
asked a restaurant to alternate silence, popular music, and classical
music on successive nights over 18 days. Each type of music was played
for six nights (the order was randomly determined to guard against
confounding). The data file RestaurantMusic
gives the type
of music played and the amount spent on food and drinks (in British
pounds) for each customer.
The data can be found at this url:
http://www.isi-stats.com/isi/data/chap9/RestaurantMusic.txt
Load the data
Create side-by-side boxplots to show the relationship between the music played and the amount of money diners spent.
Calculate and report the sample means and standard deviations for the amount spent on food and drinks for each music type.
Classical:
Pop:
None:
Construct dotplots, for each of the three different treatments of music, of amounts spent on food and drink. Include appropriate titles for each plot.
State and check validity conditions to see whether you can run a theory-based ANOVA test. Be sure to explain how you are checking.
Validity Conditions:
p-value:
If a significant result was found, construct additional follow-up confidence intervals to see where differences between the groups are.
Write up your conclusions regarding significance with context. Report and interpret any pairwise 95% confidence intervals. Address applicable causation and generalization conclusions that can be made for this study.
Significance:
Confidence intervals:
Causation:
Generalization: