For every lab we will begin by loading the four packages: openintro, rmarkdown, tidyverse, and readr
#load packages
library(rmarkdown)
library(openintro)
library(mosaic)
library(ggformula)
We will start by loading the example data,
GSS_clean.csv
. We’ll use the read.csv()
function to read in the data.
#load data
<- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv") GSS
We consider the research question, are self-employed people more or less likely to believe that marijuana should be made legal than people that work for someone else? We will use data from the GSS (General Social Survey) of a random sample of US adults.
Before we begin, we do some additional data cleaning. The commands below remove observational units that have no entry for the variables we are studying. Look at the observational unit count in the GSS data in the Environment pane. After running the code below the number of observational units will decrease to 1393 people.
<- filter(GSS, should_marijuana_be_made_legal != "")
GSS <- filter(GSS, self_emp_or_works_for_somebody != "") GSS
Take a look at the GSS data. There are 45 variables, such as
occupation code, martial status, and whether the person supports the
death penalty for murder. Let’s look at the variables relevant to our
research question, namely should_marijuana_be_made_legal
and self_emp_or_works_for_somebody
. The code below will
create a segmented bar graph. The two bars are for the binary
explanatory variable and the fill color represents the binary response
variable. The option position = "fill"
changes the display
from counts to proportions. The proportions represent the proportion of
observational units from each employment type that believe marijuana
should be legalized.
gf_bar(~self_emp_or_works_for_somebody, fill = ~should_marijuana_be_made_legal , data=GSS, position="fill")
In addition to the segmented bar graph, let’s look at the two-way table of counts and a two-way table of proportions for our data.
#table of counts
tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS)
## self_emp_or_works_for_somebody
## should_marijuana_be_made_legal Self-employed Someone else
## Legal 98 809
## Not legal 44 442
Are the validity conditions met to perform a two-sample z-test? Yes, each of the cells in the two-way table is at least 10.
Below, we add the option format = "proportion"
to change
the table entries from counts to conditional proportions. Note that the
order in the function formatting
tally( RESPONSEVariable ~ EXPLANATORYVariable )
is
important so that the correct conditional proportions are calculated in
the table!
#table of conditional proportions. Be careful with the order of the variables!
tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS, format = "proportion")
## self_emp_or_works_for_somebody
## should_marijuana_be_made_legal Self-employed Someone else
## Legal 0.6901408 0.6466827
## Not legal 0.3098592 0.3533173
98/(98+44)
## [1] 0.6901408
Our statistic is the difference in proportions of self-employed people that believe marijuana should be legal and the proportion of employees that work for someone else and believe marijuana should be legal. We use the notation \(\hat{p}_1 = 0.690\) and \(\hat{p}_2 = 0.647\) for these proportions and calculate the observed difference in proportions below.
<- 0.690
phat_1 <- 0.647
phat_2 <- phat_1 - phat_2
obs_diff obs_diff
## [1] 0.043
Our observed statistic is the difference is 0.043. This number represents that 4.3% more of the people who are self-employed believe marijuana should be made legal, compared to those who are employed by someone else.
Next, let’s set up our hypothesis test with \(\pi_1\) representing the proportion of people who are self-employed and believe marajuana should be legal and \(\pi_2\) representing the proportion of people employed for someone else that believe the marajuana should be legal.
\[ H_0: \pi_1 - \pi_2 = 0 \\ H_a: \pi_1 - \pi_2 \neq 0 \]
In order to do the test, we use the prop.test function shown below. The basic format is
prop.test( RESPONSEVariable ~ EXPLANATORYVariable, data=NAME_OF_DATA )
We add a few more arguments to our function to indicate we are performing a two sided test and that we are designating success by the desire to legalize marijuana. Note: to perform a one-sided test of twp proportions, instead of “two.sided” we use “greater” or “less”.
prop.test(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS, conf.level = 0.95, alternative = "two.sided", success = "Legal")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody)
## X-squared = 0.87754, df = 1, p-value = 0.3489
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04100264 0.12791902
## sample estimates:
## prop 1 prop 2
## 0.6901408 0.6466827
Our p-value is 0.3489, which is very large. So, we fail to reject our null hypothesis. We do not have evidence to suggest that the true proportion of people who support the legalization of marijuana is different between self-employed people, and those who work for someone else.
Another method to use for inference for two proportions is to find the standardized statistic and approximate the null distribution using a normal distribution. For two proportions, in a hypothesis test the standard error is given by \[ SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \] where \(\hat{p}\) is the pooled success proportion. Using the 2-way table of counts above, we calculate sample sizes \(n_1\), \(n_2\), and the pooled proportion of success \(\hat{p}\). With those values we can calculate the standard error and the standardized statistic \[ z = \frac{\textrm{statistic} - \textrm{mean of the null distribution}}{\textrm{standard error of the null distribution}}.\]
#sample size for each group, numbers from the table of counts above
<- 98+44
n_1 <- 809+442
n_2
#define success as support for legalizing marijuana
<- 98+809
total_success <- 98+44+809+442
total_sample_size
#calculating pooled proportion of success
<- total_success/total_sample_size
pool_prop pool_prop
## [1] 0.6511127
#mean of the null distribution
<- 0
null_pi
#calculation of standard error and standardized z-statisic
<- sqrt((pool_prop*(1-pool_prop)/n_1)+(pool_prop*(1-pool_prop)/n_2))
SE <- (obs_diff - null_pi)/SE
z_stat z_stat
## [1] 1.018814
The value of the z-statistic alone tells us that the observed
difference in proportions is not in the tail of the null distribution.
Completing the calculation of a p-value we use the function
pnorm( )
from Lab 2. The default of the
pnorm( )
function is to calculate the area under the normal
distribution that is left of the observed statistic, since our observed
statistic is positive we want to calculate the area in the right tail,
thus the option lower.tail=FALSE
and since our test is
two-sided we multiply this p-value by 2.
<- 2*pnorm(z_stat, lower.tail = FALSE)
SSp_value SSp_value
## [1] 0.3082911
Our p-value from the normal distribution gives insufficient evidence to reject the null hypothesis.
For two proportions, the standard error used in a confidence interval is given by \[ SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}. \] A 95% 2SD confidence interval can be obtained as shown below.
<- 0.690
p_1 <- 0.647
p_2
<- sqrt(p_1*(1-p_1)/n_1 + p_2*(1-p_2)/n_2)
SE_conf <- obs_diff-2*SE_conf
left_endpoint <- obs_diff+2*SE_conf
right_endpoint
#endpoints of the 2SD confidence interval
left_endpoint
## [1] -0.03919257
right_endpoint
## [1] 0.1251926
Conclusions. Interpreting this calculation, we are 95% confident that the proportion of people who are self-employed and believe marijuana should be made legal is between 3.9 percentage points less and 12.5 percentage points more than the proportion of non-self-employed people who believe marijuana should be made legal. Since this confidence interval contains 0, we conclude that 0 is a plausible value for the difference in proportions. This conclusion aligns with the lack of evidence seen above for rejecting the null hypothesis.
Had our results been significant, we might have considered generalizing to the population of US adults since we started with a random sample of adults. This study is observational, so we cannot conclude cause and effect relationships.
We consider the research question: are the mean number of years of school completed different between people born in the US and those born outside the US? Again we will use data from the GSS (General Social Survey) of a random sample of US adults.
explanatory variable is birth location (categorical & binary: born in/outside the US), and
response variable is mean number of years of school. (quantitative).
Let’s start by looking at the number of years of school completed variable grouped by whether or not the observational unit was born in the US.
gf_boxplot(highest_year_of_school_completed ~ born_in_us, data=GSS, na.rm=TRUE)
We see the validity conditions for a two sample t-test are met because we have more than 20 observations and the box plots show the skewness is not strong.
We’ll start by computing the observed statistic \(\bar{x}_{\textrm{diff}} = \bar{x}_{1} -
\bar{x}_{2}\). We calculate the mean according to the group value
of born_in_us
, which is a categorical variable with values
of Yes/No.
mean(highest_year_of_school_completed ~ born_in_us, data=GSS)
## No Yes
## NA 13.96975
= 12.85-13.97
xbar_diff xbar_diff
## [1] -1.12
Our observed difference in means is -1.12, which means that people born outside the US are getting 1.12 fewer years of school than people born in the US.
Now, let’s perform a hypothesis test. We want to know if the means are significantly different from one another. We write out our hypotheses as follows.
\[
H_0: \mu_1 - \mu_2 = 0 \\
H_A: \mu_1 - \mu_2 \neq 0
\] We use the function
t.test(RESPONSEVariable ~ EXPLANATORYVariable, data = NAME_OF_DATA )
to find both a p-value and a confidence interval. The options
alternative="two.sided", mu = 0
indicate that our
hypothesis test is two-sided and the mean of our null distribution is 0.
We are also using the option conf.level = 0.99
to calculate
a 99% confidence interval for the difference in means.
t.test(highest_year_of_school_completed ~ born_in_us, data=GSS, alternative="two.sided", mu = 0, conf.level = 0.99)
##
## Welch Two Sample t-test
##
## data: highest_year_of_school_completed by born_in_us
## t = -3.2989, df = 187.39, p-value = 0.001162
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 99 percent confidence interval:
## -2.0099450 -0.2372403
## sample estimates:
## mean in group No mean in group Yes
## 12.84615 13.96975
Conclusions: Our p-value is 0.001162 and standardized statistic is \(t=-3.3\), so at both the 5% level and the 10% level, we have sufficient evidence to reject our null hypothesis. Our evidence to suggests that there is a difference in mean number of years of school completed between people born inside and outside the US.
We are 99% confident that the true mean number of years of school completed by someone born outside the US is between 2.01 years less and 0.23 years less than someone born in the US.
Since our data is from a random sample of US adults we may generalize this result to the larger population of US adults.
Finally, we want to think about paired data and study the question: Is the mean level of education is different between mothers and fathers? Because a person’s mother and father are “naturally paired,” if we want to see if the mean level of education is different between mothers and fathers, we should really make a new variable of the differences and do inference on that.
The following code creates a new variable in the GSS data called
diff
. This will be the 46th variable in the table.
<- transform(GSS, diff = highest_year_school_completed_father- highest_year_school_completed_mother) GSS
We define \(\mu_{\textrm{diff}} = \mu_{\textrm{father}} - \mu_{\textrm{mother}}\)
Our hypotheses are \[H_0: \mu_{\textrm{diff}} =0 \\ H_a: \mu_{\textrm{diff}} \neq 0\]
We can view this new variable with a histogram.
gf_histogram(~diff, data = GSS, na.rm=TRUE)
To use the theory based test we need at least 20 pairs and the data shouldn’t have strong skewness OR symmetry in the distribution of difference. As seen in the histogram above the validity conditions for inference are met.
Now, we can do inference just as we would for a single mean, starting with finding observed mean difference.
mean(~diff, data = GSS)
## [1] NA
Our observed mean difference is -0.057, which means that the average number of years that a mother went to school is 0.057 years more than the fathers.
Now, we can do inference similar to what we did in Lab 2.
t.test(~diff, data = GSS, alternative = "two.sided", mu = 0)
##
## One Sample t-test
##
## data: diff
## t = -0.56114, df = 939, p-value = 0.5748
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.2583577 0.1434641
## sample estimates:
## mean of x
## -0.05744681
Our p-value of 0.5748, standardized statistic of \(t=-0.56\), and our confidence interval (-0.258, 0.143) contains 0. All of these calculations show that we cannot reject the null hypothesis. This evidence supports the claim that the mean level of education between mothers and fathers is the same.
Complete the following exercises in R. Include the problem statement, your code, and write up your results in an Rmarkdown file. Please use the template: Lab Report from {openintro}.
<- read.table("http://www.isi-stats.com/isi/data/chap7/ExerciseMemorize.txt", header = TRUE) ExMem
Explanatory: Response:
Write down the hypotheses in either words or symbols.
Add an additional variable diff
in
ExMem
Are validity conditions met to perform theory-based inference. Explain and include a graph.
#include graph
#inference
#load the data
<- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/MarriageViewGeneration.csv") MarryByGen
Write out the null and alternative hypothesis in symbols or words.
Create a segmented bar graph and a two-way table with counts for the MarryByGen dataset. Be careful with the order of your explanatory and response variables!
#segmented bar graph
#two-way table of counts
Are the validity conditions met for theory-based inference. Explain.
Perform the appropriate test of inference.
#inference