Goals of this lab.

1. to use Theory Based Inference for comparing two or more proportions.

2. to use Theory Based Inference for comparing two means.

3. to use Theory Based Inference for paired data (quantitative response variable).

For every lab we will begin by loading the four packages: openintro, rmarkdown, tidyverse, and readr

#load packages
library(rmarkdown)
library(openintro)
library(mosaic)    
library(ggformula)

Loading in data

We will start by loading the example data, GSS_clean.csv. We’ll use the read.csv() function to read in the data.

#load data
GSS <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv")

Chapters 5 & 8: Theory Based Inference for a Two or More Proportions

Example 1.

We consider the research question, are self-employed people more or less likely to believe that marijuana should be made legal than people that work for someone else? We will use data from the GSS (General Social Survey) of a random sample of US adults.

  • explanatory variable is employment type (categorical and binary: self-employed or works for someone else), and
  • response variable is opinion about legalizing marijuana (categorical and binary: legalize, don’t legalize).

Before we begin, we do some additional data cleaning. The commands below remove observational units that have no entry for the variables we are studying. Look at the observational unit count in the GSS data in the Environment pane. After running the code below the number of observational units will decrease to 1393 people.

GSS <- filter(GSS, should_marijuana_be_made_legal != "")
GSS <- filter(GSS, self_emp_or_works_for_somebody != "")

Take a look at the GSS data. There are 45 variables, such as occupation code, martial status, and whether the person supports the death penalty for murder. Let’s look at the variables relevant to our research question, namely should_marijuana_be_made_legal and self_emp_or_works_for_somebody. The code below will create a segmented bar graph. The two bars are for the binary explanatory variable and the fill color represents the binary response variable. The option position = "fill" changes the display from counts to proportions. The proportions represent the proportion of observational units from each employment type that believe marijuana should be legalized.

gf_bar(~self_emp_or_works_for_somebody, fill = ~should_marijuana_be_made_legal , data=GSS, position="fill")

In addition to the segmented bar graph, let’s look at the two-way table of counts and a two-way table of proportions for our data.

#table of counts
tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS)
##                               self_emp_or_works_for_somebody
## should_marijuana_be_made_legal Self-employed Someone else
##                      Legal                98          809
##                      Not legal            44          442

Are the validity conditions met to perform a two-sample z-test? Yes, each of the cells in the two-way table is at least 10.

Below, we add the option format = "proportion" to change the table entries from counts to conditional proportions. Note that the order in the function formatting tally( RESPONSEVariable ~ EXPLANATORYVariable ) is important so that the correct conditional proportions are calculated in the table!

#table of conditional proportions. Be careful with the order of the variables!
tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS, format = "proportion")
##                               self_emp_or_works_for_somebody
## should_marijuana_be_made_legal Self-employed Someone else
##                      Legal         0.6901408    0.6466827
##                      Not legal     0.3098592    0.3533173
98/(98+44)
## [1] 0.6901408

Our statistic is the difference in proportions of self-employed people that believe marijuana should be legal and the proportion of employees that work for someone else and believe marijuana should be legal. We use the notation \(\hat{p}_1 = 0.690\) and \(\hat{p}_2 = 0.647\) for these proportions and calculate the observed difference in proportions below.

phat_1 <- 0.690
phat_2 <- 0.647
obs_diff <- phat_1 - phat_2
obs_diff
## [1] 0.043

Our observed statistic is the difference is 0.043. This number represents that 4.3% more of the people who are self-employed believe marijuana should be made legal, compared to those who are employed by someone else.

Next, let’s set up our hypothesis test with \(\pi_1\) representing the proportion of people who are self-employed and believe marajuana should be legal and \(\pi_2\) representing the proportion of people employed for someone else that believe the marajuana should be legal.

\[ H_0: \pi_1 - \pi_2 = 0 \\ H_a: \pi_1 - \pi_2 \neq 0 \]

Using the Chi-square test to compare multiple proportions.

In order to do the test, we use the prop.test function shown below. The basic format is

prop.test( RESPONSEVariable ~ EXPLANATORYVariable, data=NAME_OF_DATA )

We add a few more arguments to our function to indicate we are performing a two sided test and that we are designating success by the desire to legalize marijuana. Note: to perform a one-sided test of twp proportions, instead of “two.sided” we use “greater” or “less”.

prop.test(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody, data = GSS,  conf.level = 0.95, alternative = "two.sided", success = "Legal")
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  tally(should_marijuana_be_made_legal ~ self_emp_or_works_for_somebody)
## X-squared = 0.87754, df = 1, p-value = 0.3489
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.04100264  0.12791902
## sample estimates:
##    prop 1    prop 2 
## 0.6901408 0.6466827

Our p-value is 0.3489, which is very large. So, we fail to reject our null hypothesis. We do not have evidence to suggest that the true proportion of people who support the legalization of marijuana is different between self-employed people, and those who work for someone else.

Using the Normal distribution for a two-sample z-test.

Another method to use for inference for two proportions is to find the standardized statistic and approximate the null distribution using a normal distribution. For two proportions, in a hypothesis test the standard error is given by \[ SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \] where \(\hat{p}\) is the pooled success proportion. Using the 2-way table of counts above, we calculate sample sizes \(n_1\), \(n_2\), and the pooled proportion of success \(\hat{p}\). With those values we can calculate the standard error and the standardized statistic \[ z = \frac{\textrm{statistic} - \textrm{mean of the null distribution}}{\textrm{standard error of the null distribution}}.\]

#sample size for each group, numbers from the table of counts above
n_1 <- 98+44   
n_2 <- 809+442

#define success as support for legalizing marijuana
total_success <- 98+809
total_sample_size <- 98+44+809+442

#calculating pooled proportion of success
pool_prop <- total_success/total_sample_size
pool_prop
## [1] 0.6511127
#mean of the null distribution
null_pi <- 0

#calculation of standard error and standardized z-statisic
SE <- sqrt((pool_prop*(1-pool_prop)/n_1)+(pool_prop*(1-pool_prop)/n_2))
z_stat <- (obs_diff - null_pi)/SE
z_stat
## [1] 1.018814

The value of the z-statistic alone tells us that the observed difference in proportions is not in the tail of the null distribution. Completing the calculation of a p-value we use the function pnorm( ) from Lab 2. The default of the pnorm( ) function is to calculate the area under the normal distribution that is left of the observed statistic, since our observed statistic is positive we want to calculate the area in the right tail, thus the option lower.tail=FALSE and since our test is two-sided we multiply this p-value by 2.

SSp_value <- 2*pnorm(z_stat, lower.tail = FALSE)
SSp_value
## [1] 0.3082911

Our p-value from the normal distribution gives insufficient evidence to reject the null hypothesis.

For two proportions, the standard error used in a confidence interval is given by \[ SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}. \] A 95% 2SD confidence interval can be obtained as shown below.

p_1 <- 0.690
p_2 <- 0.647

SE_conf <- sqrt(p_1*(1-p_1)/n_1 + p_2*(1-p_2)/n_2)
left_endpoint <- obs_diff-2*SE_conf
right_endpoint <- obs_diff+2*SE_conf

#endpoints of the 2SD confidence interval
left_endpoint 
## [1] -0.03919257
right_endpoint
## [1] 0.1251926

Conclusions. Interpreting this calculation, we are 95% confident that the proportion of people who are self-employed and believe marijuana should be made legal is between 3.9 percentage points less and 12.5 percentage points more than the proportion of non-self-employed people who believe marijuana should be made legal. Since this confidence interval contains 0, we conclude that 0 is a plausible value for the difference in proportions. This conclusion aligns with the lack of evidence seen above for rejecting the null hypothesis.

Had our results been significant, we might have considered generalizing to the population of US adults since we started with a random sample of adults. This study is observational, so we cannot conclude cause and effect relationships.

Chapters 6: Theory Based Inference for Two Means

Example 2.

We consider the research question: are the mean number of years of school completed different between people born in the US and those born outside the US? Again we will use data from the GSS (General Social Survey) of a random sample of US adults.

  • explanatory variable is birth location (categorical & binary: born in/outside the US), and

  • response variable is mean number of years of school. (quantitative).

Let’s start by looking at the number of years of school completed variable grouped by whether or not the observational unit was born in the US.

gf_boxplot(highest_year_of_school_completed ~ born_in_us, data=GSS, na.rm=TRUE)

We see the validity conditions for a two sample t-test are met because we have more than 20 observations and the box plots show the skewness is not strong.

We’ll start by computing the observed statistic \(\bar{x}_{\textrm{diff}} = \bar{x}_{1} - \bar{x}_{2}\). We calculate the mean according to the group value of born_in_us, which is a categorical variable with values of Yes/No.

mean(highest_year_of_school_completed ~ born_in_us, data=GSS)
##       No      Yes 
##       NA 13.96975
xbar_diff = 12.85-13.97
xbar_diff
## [1] -1.12

Our observed difference in means is -1.12, which means that people born outside the US are getting 1.12 fewer years of school than people born in the US.

Now, let’s perform a hypothesis test. We want to know if the means are significantly different from one another. We write out our hypotheses as follows.

\[ H_0: \mu_1 - \mu_2 = 0 \\ H_A: \mu_1 - \mu_2 \neq 0 \] We use the function t.test(RESPONSEVariable ~ EXPLANATORYVariable, data = NAME_OF_DATA ) to find both a p-value and a confidence interval. The options alternative="two.sided", mu = 0 indicate that our hypothesis test is two-sided and the mean of our null distribution is 0. We are also using the option conf.level = 0.99 to calculate a 99% confidence interval for the difference in means.

t.test(highest_year_of_school_completed ~ born_in_us, data=GSS, alternative="two.sided", mu = 0, conf.level = 0.99)
## 
##  Welch Two Sample t-test
## 
## data:  highest_year_of_school_completed by born_in_us
## t = -3.2989, df = 187.39, p-value = 0.001162
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 99 percent confidence interval:
##  -2.0099450 -0.2372403
## sample estimates:
##  mean in group No mean in group Yes 
##          12.84615          13.96975

Conclusions: Our p-value is 0.001162 and standardized statistic is \(t=-3.3\), so at both the 5% level and the 10% level, we have sufficient evidence to reject our null hypothesis. Our evidence to suggests that there is a difference in mean number of years of school completed between people born inside and outside the US.

We are 99% confident that the true mean number of years of school completed by someone born outside the US is between 2.01 years less and 0.23 years less than someone born in the US.

Since our data is from a random sample of US adults we may generalize this result to the larger population of US adults.

Chapter 7: Inference for paired data

Example 3.

Finally, we want to think about paired data and study the question: Is the mean level of education is different between mothers and fathers? Because a person’s mother and father are “naturally paired,” if we want to see if the mean level of education is different between mothers and fathers, we should really make a new variable of the differences and do inference on that.

The following code creates a new variable in the GSS data called diff. This will be the 46th variable in the table.

GSS <- transform(GSS, diff = highest_year_school_completed_father- highest_year_school_completed_mother)

We define \(\mu_{\textrm{diff}} = \mu_{\textrm{father}} - \mu_{\textrm{mother}}\)

Our hypotheses are \[H_0: \mu_{\textrm{diff}} =0 \\ H_a: \mu_{\textrm{diff}} \neq 0\]

We can view this new variable with a histogram.

gf_histogram(~diff, data = GSS, na.rm=TRUE)

To use the theory based test we need at least 20 pairs and the data shouldn’t have strong skewness OR symmetry in the distribution of difference. As seen in the histogram above the validity conditions for inference are met.

Now, we can do inference just as we would for a single mean, starting with finding observed mean difference.

mean(~diff, data = GSS)
## [1] NA

Our observed mean difference is -0.057, which means that the average number of years that a mother went to school is 0.057 years more than the fathers.

Now, we can do inference similar to what we did in Lab 2.

t.test(~diff, data = GSS, alternative = "two.sided", mu = 0)
## 
##  One Sample t-test
## 
## data:  diff
## t = -0.56114, df = 939, p-value = 0.5748
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.2583577  0.1434641
## sample estimates:
##   mean of x 
## -0.05744681

Our p-value of 0.5748, standardized statistic of \(t=-0.56\), and our confidence interval (-0.258, 0.143) contains 0. All of these calculations show that we cannot reject the null hypothesis. This evidence supports the claim that the mean level of education between mothers and fathers is the same.

Exercises.

Complete the following exercises in R. Include the problem statement, your code, and write up your results in an Rmarkdown file. Please use the template: Lab Report from {openintro}.

  1. Students wanted to see whether exercising (in particular doing jumping jacks) would help or hinder people’s ability to memorize a list of 10 words that were read to them. This would be compared to the same test of memorizing words while the subjects were sitting down. The number of words memorized for the 31 subjects can be found in the data file ExerciseMemorize. Load the data using the following code.
ExMem <- read.table("http://www.isi-stats.com/isi/data/chap7/ExerciseMemorize.txt", header = TRUE)
  • What is the explanatory and what is the response variable in this study? List the variable type next to each variable.

Explanatory: Response:

  • Write down the hypotheses in either words or symbols.

  • Add an additional variable diff in ExMem

  • Are validity conditions met to perform theory-based inference. Explain and include a graph.

#include graph 
  • Perform the appropriate significance test to determine whether exercising while trying to memorize a list of words helps or hinders the process. Write your conclusions out in sentences.
#inference
  1. Do different generations view marriage differently? A 2010 survey of a random sample of adult Americans conducted by the Pew Research Center asked the following question of each participant: “Is marriage becoming obsolete?” The results from the survey are found in MarriageByGeneration.
#load the data
MarryByGen <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/MarriageViewGeneration.csv")
  • Write out the null and alternative hypothesis in symbols or words.

  • Create a segmented bar graph and a two-way table with counts for the MarryByGen dataset. Be careful with the order of your explanatory and response variables!

#segmented bar graph
#two-way table of counts
  • Are the validity conditions met for theory-based inference. Explain.

  • Perform the appropriate test of inference.

#inference
  • Is there significant evidence to reject the null hypothesis? Explain how you know.