Lab 2: Inference for One Proportion

Goals for this lab.

To check Validity Conditions for Theory-Based methods for Inference with One Proportion
To apply Theory-Based methods for Inference with One Proportion when Validity conditions are satisfied and draw appropriate conclusions.
To apply calculation techniques and create graphics using tools from Lab 1.

Setup and packages

We use two packages in this course: mosaic and ggformula. To load a package, you use the library() function, wrapped around the name of a package. I’ve put the code to load one package into the chunk below. Add the other package you need.

library(mosaic)
library(ggformula)
# put in the other package that you need here

This step of loading our two main packages will be a necessary first step in all of our labs this semester.

Loading in data

We’ll load the example data, GSS_clean.csv. It should be inside the data folder in your RStudio Cloud and it is also available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv. We’ll use the read.csv() function to read in the data.

#load data
GSS <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv")

This dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use. But, there are still some messy aspects (which we’ll discover as we analyze it further throughout this class!).

Basic commands to view the data.

Recall from Lab 1, the following commands can be used to view parts of the data: glimpse, head, tail. You can also view the data in another tab by clicking on ‘GSS’ in the Environment pane. This allows you to scroll up and down, and left and right to view the data.

Use these commands on the GSS data to answer the following questions.

What are the observational units?
How many observational units are there?
Is there any missing data? How can you tell?
How many variables are there?
Name a couple of variables that are quantitative.

Number of children, age of respondent, highest year of school completed, number of siblings, & many others.

Pick one quantitative variable and calculate the mean value.

mean(~highest_year_of_school_completed, data=GSS, na.rm=TRUE)

## [1] 13.73177

Name a couple variables that are categorical.

college major 1, college major 2, diploma/GED/Other, self-employed or works for somebody, Occupation code, & many others.

Pick one categorical variable and create a table of counts for the categories.

tally(~labor_force_status, data=GSS)

## labor_force_status
##    Keeping house            Other          Retired           School 
##              242               48              445               81 
## Temp not working Unempl, laid off Working fulltime Working parttime 
##               53               84             1134              259 
##             <NA> 
##                2

Pick out an observation to write about. What are some characteristics of this observation?

Example: Observational unit number 10 is a 55 year old father of 2. He works a 40 hour work week in a private (non-government) job. He has a high-school diploma with a total 12 years of education, his spouse has 11 years, father has 6 years and mother has 20 years of education.

What is a question we could answer using this data?

Examples: Do people who are self-employed work more hours than people that work for someone else? Do religious people support the death penalty similarly, regardless of their religion? Is a persons happiness level associated with their age? Are people with higher household incomes more or less likely to believe that marijuana should be illegal?

Note: Why is the variable age_of_respondent a character <chr> and not an integer <int>?

glimpse(GSS)

head(GSS)
tail(GSS)

Research Question

Our research question involves looking at the proportion of people that are self-employed. We create a tally to determine the proportion of people in our sample from the GSS that are self employed.

tally( ~self_emp_or_works_for_somebody, data=GSS, format="proportion")

## self_emp_or_works_for_somebody
## Self-employed  Someone else          <NA> 
##    0.09880750    0.86413969    0.03705281

We have several NA values. Let’s filter those out.

GSS <- filter(GSS, !is.na(self_emp_or_works_for_somebody))

tally( ~self_emp_or_works_for_somebody, data=GSS, format="proportion")

## self_emp_or_works_for_somebody
## Self-employed  Someone else 
##     0.1026095     0.8973905

We will consider the population of people in the US workforce. We want to investigate whether or not 10% of the population are self-employed. Our null and alternative hypotheses are given below.

Hypotheses:

\[H_0: \pi = 0.10\] \[H_a:\pi \neq 0.10\]

Validity Conditions

When doing inference for a single proportion, our theory-based methods use the standard normal distribution. The normal distribution can be thought of as a prediction of what would occur if a simulation was done. Many times this prediction is valid, but not always. The theoretical underpinnings of the approximation consider the prediction to be valid when the data contains at least 10 successes and at least 10 failures.

Whenever we use theory-based methods we first must check the validity conditions. Calculate the number of people that are self-employed and the number of people that work for someone else. There are two ways to do this: use calculations from above and R as a calculator OR create a table of counts.

#Using R as a calculator
#as calculated above the proportion of people that are self employed is 
prop_self = 0.1026095
#the total number of people in the survey is
n=2261

#number self-employed:
n*prop_self

## [1] 232.0001

Since the proportion of people that work for someone else is larger than 0.1026095 then the count will be larger as well. Thus both are greater than 10.

Alternatively, we can use a table of counts to check the validity conditions.

#a table of counts
tally( ~self_emp_or_works_for_somebody, data=GSS)

## self_emp_or_works_for_somebody
## Self-employed  Someone else 
##           232          2029

The validity conditions are met since the number of self_emp workers and those that works_for_somebody are both larger than 10.

Inference for One Proportion

The command to test for inference on one proportion is prop.test.

totalworkers = 232+2029
totalworkers

## [1] 2261

#inference for one proportion
prop.test(~self_emp_or_works_for_somebody, data = GSS, success = "Self-employed", alternative = "two.sided", p = 0.10)

## 
##  1-sample proportions test with continuity correction
## 
## data:  GSS$self_emp_or_works_for_somebody  [with success = Self-employed]
## X-squared = 0.1433, df = 1, p-value = 0.705
## alternative hypothesis: true p is not equal to 0.1
## 95 percent confidence interval:
##  0.09055923 0.11603153
## sample estimates:
##         p 
## 0.1026095

#if you don't have a data file use successes=232, n=2261, null-hyp parameter pi=0.10
prop.test(232, 2261, p=0.10, alternative = "two.sided")

## 
##  1-sample proportions test with continuity correction
## 
## data:  232 out of 2261
## X-squared = 0.1433, df = 1, p-value = 0.705
## alternative hypothesis: true p is not equal to 0.1
## 95 percent confidence interval:
##  0.09055923 0.11603153
## sample estimates:
##         p 
## 0.1026095

#command to use for no continuity correction
#prop.test(~self_emp_or_works_for_somebody, data = GSS, success = "Self-employed", alternative = "two.sided", p = 0.10, correct=FALSE)

phat = 232/(232 +2029)
phat

## [1] 0.1026095

We have a large p-value = 0.705, so our statistic phat = 0.1026 is very likely to occur if the null hypothesis is true. Therefore we cannot reject the null-hypothesis and conclude that the proportion of the population of people that are self-employed is indeed 10%.

The standardized \(z\)-statistic

Let’s verify this calculation by calculating the standardized z-statistic. First we need to calculate the standard deviation. For one proportion, in a hypothesis test

\[ SD(\pi)=\sqrt{\frac{\pi(1-\pi)}{n}} \]

where \(n\) is the number of observational units and \(\pi\) is the proportion from the null-hypothesis.

SD <- sqrt(0.1*(1-0.1)/n)
SD

## [1] 0.006309152

The standardized \(z\)-statistic is \[z = \frac{\hat{p} - \pi}{SD(\pi)}\]

z = (phat - 0.10)/SD
z

## [1] 0.4135999

The standardized statistic of \(z=0.4136\) gives the same result as the hypothesis test. The observed statistic \(\hat{p} = 0.1026\) is less than half a standard deviation away from the mean of the null-distribution, \(0.10\). We are very likely to see a standardized statistic that small when the null-hypothesis is true. Therefore we do not reject the null-hypothesis. Our data suggest that the proportion of the population of people that are self-employed is indeed 10%.

#visualization of the p-value calculated by prop.test is the area (two-sided) shaded below
library(tigerstats)
pnormGC(phat, region="above", mean=0.1, sd=SD, graph=TRUE)

Since our hypothesis test is two sided, the p-value is approximated by multiplying this area by 2. Compare this value with the p-value calculated from the 1-sample proportions test above.

2*0.3396

## [1] 0.6792

Lab 2 Exercises

A legendary story on college campuses concerns two students who miss a chemistry exam because of excessive partying but blame their absence on a flat tire. The professor allows them to take a make-up exam, and sends them into separate rooms to take it. The first question, worth 5 points, is quite easy. The second question, worth 95 points, asks: Which tire was it?

Research Question

Do students pick which tire went flat in equal proportions? It has been conjectured that when students are asked this question and forced to give an answer (left front, right front, left rear, or right rear) off the top of their head, they tend to answer “right front” more than would be expected.

Design a study and collect data

To test this conjecture about the right front tire, a recent class of students were asked if they were in this situation, which tire would they say had gone flat. The results can be found in the file: https://raw.githubusercontent.com/IJohnson-math/Math138/main/WhichTire.csv.

Load this data into R below. Name the data tires

#load the data

What are the observational units, variables and variable types.

Observational units (include the number of them):

Variable(s) (include the type):

Describe the parameter of interest, \(\pi\), in words.
State the appropriate null and alternative hypotheses to be tested.

Explore the data

What percentage of the students picked the right front tire? What is the count of students that picked the right front tire? Is it more than you would expect if students randomly pick one of the four tires? (include code used to explore the data in an R-chunk below)
Is it possible to observe the percentage found in problem 5 if the student were just selecting randomly?

Draw inferences

Calculate the proportion of students that select the right front tire and call this proportion phat. This is our observed statistic.

#phat <-

Use R to calculate the standard deviation of the null-distribution, \(SD(\pi)\).
Use R to calculate the standardized statistic. Interpret the meaning of the standardized statistic in a sentence below your calculation.

z<- 
z

Check the validity conditions for a theory-based hypothesis test. Explain how you know the conditions are met or not met.
Apply the theory-based test to determine a p-value.

Formulate conclusions

Is there strong evidence against the null hypothesis? Summarize the conclusion that you draw from this study and your analysis of the answers to 9 and 11. Explain the reasoning behind your conclusion.

Another study.

Statistics students were asked to randomly select a whole number between 1 and 10. Sixty-two out of 101 students picked a number larger than 5. If they truly randomly picked their numbers we would expect about half the students to pick a number greater than 5 in the long run. Do statistics students really randomly pick their numbers or not? Complete a hypothesis test by answering the following questions

State the hypotheses in words.

\(H_0:\)

\(H_a:\)

Write the value of the statistic \(\hat{p}\).
Are the validity conditions met to use a theory-based test? Explain what you are checking in a sentence.
Use R to calculate the theory-based \(p\)-value.

# calculate the p-value.

Write out your conclusion in the context of the research question.