To check Validity Conditions for Theory-Based methods for Inference with One Proportion
To apply Theory-Based methods for Inference with One Proportion when Validity conditions are satisfied and draw appropriate conclusions.
To apply calculation techniques and create graphics using tools from Lab 1.
We use two packages in this course: mosaic
and
ggformula
. To load a package, you use the
library()
function, wrapped around the name of a package.
I’ve put the code to load one package into the chunk below. Add the
other package you need.
library(mosaic)
library(ggformula)
# put in the other package that you need here
This step of loading our two main packages will be a necessary first step in all of our labs this semester.
We’ll load the example data, GSS_clean.csv
. It should be
inside the data
folder in your RStudio Cloud and it is also
available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv.
We’ll use the read.csv()
function to read in the data.
#load data
<- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv") GSS
This dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use. But, there are still some messy aspects (which we’ll discover as we analyze it further throughout this class!).
Recall from Lab 1, the following commands can be used to view parts
of the data: glimpse
, head
, tail
.
You can also view the data in another tab by clicking on ‘GSS’ in the
Environment pane. This allows you to scroll up and
down, and left and right to view the data.
Use these commands on the GSS data to answer the following questions.
Number of children, age of respondent, highest year of school completed, number of siblings, & many others.
mean(~highest_year_of_school_completed, data=GSS, na.rm=TRUE)
## [1] 13.73177
college major 1, college major 2, diploma/GED/Other, self-employed or works for somebody, Occupation code, & many others.
tally(~labor_force_status, data=GSS)
## labor_force_status
## Keeping house Other Retired School
## 242 48 445 81
## Temp not working Unempl, laid off Working fulltime Working parttime
## 53 84 1134 259
## <NA>
## 2
Example: Observational unit number 10 is a 55 year old father of 2. He works a 40 hour work week in a private (non-government) job. He has a high-school diploma with a total 12 years of education, his spouse has 11 years, father has 6 years and mother has 20 years of education.
Examples: Do people who are self-employed work more hours than people that work for someone else? Do religious people support the death penalty similarly, regardless of their religion? Is a persons happiness level associated with their age? Are people with higher household incomes more or less likely to believe that marijuana should be illegal?
Note: Why is the variable age_of_respondent
a character
<chr>
and not an integer
<int>
?
glimpse(GSS)
head(GSS)
tail(GSS)
Our research question involves looking at the proportion of people
that are self-employed. We create a tally
to determine the
proportion of people in our sample from the GSS that are self
employed.
tally( ~self_emp_or_works_for_somebody, data=GSS, format="proportion")
## self_emp_or_works_for_somebody
## Self-employed Someone else <NA>
## 0.09880750 0.86413969 0.03705281
We have several NA values. Let’s filter those out.
<- filter(GSS, !is.na(self_emp_or_works_for_somebody)) GSS
tally( ~self_emp_or_works_for_somebody, data=GSS, format="proportion")
## self_emp_or_works_for_somebody
## Self-employed Someone else
## 0.1026095 0.8973905
We will consider the population of people in the US workforce. We want to investigate whether or not 10% of the population are self-employed. Our null and alternative hypotheses are given below.
\[H_0: \pi = 0.10\] \[H_a:\pi \neq 0.10\]
When doing inference for a single proportion, our theory-based methods use the standard normal distribution. The normal distribution can be thought of as a prediction of what would occur if a simulation was done. Many times this prediction is valid, but not always. The theoretical underpinnings of the approximation consider the prediction to be valid when the data contains at least 10 successes and at least 10 failures.
Whenever we use theory-based methods we first must check the validity conditions. Calculate the number of people that are self-employed and the number of people that work for someone else. There are two ways to do this: use calculations from above and R as a calculator OR create a table of counts.
#Using R as a calculator
#as calculated above the proportion of people that are self employed is
= 0.1026095
prop_self #the total number of people in the survey is
=2261
n
#number self-employed:
*prop_self n
## [1] 232.0001
Since the proportion of people that work for someone else is larger than 0.1026095 then the count will be larger as well. Thus both are greater than 10.
Alternatively, we can use a table of counts to check the validity conditions.
#a table of counts
tally( ~self_emp_or_works_for_somebody, data=GSS)
## self_emp_or_works_for_somebody
## Self-employed Someone else
## 232 2029
The validity conditions are met since the number of self_emp workers and those that works_for_somebody are both larger than 10.
The command to test for inference on one proportion is
prop.test
.
= 232+2029
totalworkers totalworkers
## [1] 2261
#inference for one proportion
prop.test(~self_emp_or_works_for_somebody, data = GSS, success = "Self-employed", alternative = "two.sided", p = 0.10)
##
## 1-sample proportions test with continuity correction
##
## data: GSS$self_emp_or_works_for_somebody [with success = Self-employed]
## X-squared = 0.1433, df = 1, p-value = 0.705
## alternative hypothesis: true p is not equal to 0.1
## 95 percent confidence interval:
## 0.09055923 0.11603153
## sample estimates:
## p
## 0.1026095
#if you don't have a data file use successes=232, n=2261, null-hyp parameter pi=0.10
prop.test(232, 2261, p=0.10, alternative = "two.sided")
##
## 1-sample proportions test with continuity correction
##
## data: 232 out of 2261
## X-squared = 0.1433, df = 1, p-value = 0.705
## alternative hypothesis: true p is not equal to 0.1
## 95 percent confidence interval:
## 0.09055923 0.11603153
## sample estimates:
## p
## 0.1026095
#command to use for no continuity correction
#prop.test(~self_emp_or_works_for_somebody, data = GSS, success = "Self-employed", alternative = "two.sided", p = 0.10, correct=FALSE)
= 232/(232 +2029)
phat phat
## [1] 0.1026095
We have a large p-value = 0.705, so our statistic phat = 0.1026 is very likely to occur if the null hypothesis is true. Therefore we cannot reject the null-hypothesis and conclude that the proportion of the population of people that are self-employed is indeed 10%.
Let’s verify this calculation by calculating the standardized z-statistic. First we need to calculate the standard deviation. For one proportion, in a hypothesis test
\[ SD(\pi)=\sqrt{\frac{\pi(1-\pi)}{n}} \]
where \(n\) is the number of observational units and \(\pi\) is the proportion from the null-hypothesis.
<- sqrt(0.1*(1-0.1)/n)
SD SD
## [1] 0.006309152
The standardized \(z\)-statistic is \[z = \frac{\hat{p} - \pi}{SD(\pi)}\]
= (phat - 0.10)/SD
z z
## [1] 0.4135999
The standardized statistic of \(z=0.4136\) gives the same result as the hypothesis test. The observed statistic \(\hat{p} = 0.1026\) is less than half a standard deviation away from the mean of the null-distribution, \(0.10\). We are very likely to see a standardized statistic that small when the null-hypothesis is true. Therefore we do not reject the null-hypothesis. Our data suggest that the proportion of the population of people that are self-employed is indeed 10%.
#visualization of the p-value calculated by prop.test is the area (two-sided) shaded below
library(tigerstats)
pnormGC(phat, region="above", mean=0.1, sd=SD, graph=TRUE)
Since our hypothesis test is two sided, the p-value is approximated by multiplying this area by 2. Compare this value with the p-value calculated from the 1-sample proportions test above.
2*0.3396
## [1] 0.6792
A legendary story on college campuses concerns two students who miss a chemistry exam because of excessive partying but blame their absence on a flat tire. The professor allows them to take a make-up exam, and sends them into separate rooms to take it. The first question, worth 5 points, is quite easy. The second question, worth 95 points, asks: Which tire was it?
Do students pick which tire went flat in equal proportions? It has been conjectured that when students are asked this question and forced to give an answer (left front, right front, left rear, or right rear) off the top of their head, they tend to answer “right front” more than would be expected.
To test this conjecture about the right front tire, a recent class of students were asked if they were in this situation, which tire would they say had gone flat. The results can be found in the file: https://raw.githubusercontent.com/IJohnson-math/Math138/main/WhichTire.csv.
tires
#load the data
Observational units (include the number of them):
Variable(s) (include the type):
Describe the parameter of interest, \(\pi\), in words.
State the appropriate null and alternative hypotheses to be tested.
What percentage of the students picked the right front tire? What is the count of students that picked the right front tire? Is it more than you would expect if students randomly pick one of the four tires? (include code used to explore the data in an R-chunk below)
Is it possible to observe the percentage found in problem 5 if the student were just selecting randomly?
#phat <-
Use R to calculate the standard deviation of the null-distribution, \(SD(\pi)\).
Use R to calculate the standardized statistic. Interpret the meaning of the standardized statistic in a sentence below your calculation.
<-
z z
Check the validity conditions for a theory-based hypothesis test. Explain how you know the conditions are met or not met.
Apply the theory-based test to determine a p-value.
Statistics students were asked to randomly select a whole number between 1 and 10. Sixty-two out of 101 students picked a number larger than 5. If they truly randomly picked their numbers we would expect about half the students to pick a number greater than 5 in the long run. Do statistics students really randomly pick their numbers or not? Complete a hypothesis test by answering the following questions
\(H_0:\)
\(H_a:\)
Write the value of the statistic \(\hat{p}\).
Are the validity conditions met to use a theory-based test? Explain what you are checking in a sentence.
Use R to calculate the theory-based \(p\)-value.
# calculate the p-value.