Goals for this lab.

Setup and packages

As usual, we start by loading our two packages: mosaic and ggformula. To load a package, you use the library() function, wrapped around the name of a package. I’ve put the code to load one package into the chunk below. Add the other package you need.

library(mosaic)
library(ggformula)
# put in the other package that you need here

Loading in data

We’ll load the example data, GSS_clean.csv. It should be inside the data folder in your RStudio Cloud and it is also available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv. We’ll use the read.csv() function to read in the data.

#load data
GSS <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS_clean.csv")

This dataset comes from the General Social Survey (GSS), which is collected by NORC at the University of Chicago. It is a random sample of households from the United States, and has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier to use. But, there are still some messy aspects (which we’ll discover as we analyze it further throughout this class!).

Basic commands to view the data.

In Lab 2, we used the following commands to view parts of the GSS data: glimpse, head, tail. You can also view the data in another tab by clicking on ‘GSS’ in the Environment pane. This allows you to scroll up and down, and left and right to view the data.

You may use these commands on the GSS data to review the data.

head(GSS)
tail(GSS)
glimpse(GSS)

Inference for One Mean

In this lab we will consider the mean of number_of_hours_worked_last_week as our statistic of interest. Looking at the GSS data we see many NA values for the variable number_of_hours_worked_last_week. Let’s start by filtering out the NA values. The command filter is used to keep the observational units that satisfy a given property. In this example the property is !is.na(number_of_hours_worked_last_week); here, the exclamation point, !, is read as “not”, so this command keeps the observational units that do not have NA as an entry for the variable number_of_hours_worked_last_week.

GSS <- filter(GSS, !is.na(number_of_hours_worked_last_week))

Let’s look at the data. Create a histogram of number_of_hours_worked_last_week.

gf_histogram(~number_of_hours_worked_last_week, data=GSS)

Now we can compute the mean, the value of our point estimate. We name the statistic xbar (in place of the symbol \(\bar{x}\)).

xbar <- mean(~number_of_hours_worked_last_week, data=GSS)
xbar
## [1] 41.28168

Research question

Suppose we wanted to know for all U.S. workers if the mean number of hours worked in a week is different than 40. We could write our null and alternative hypotheses as

\[ H_0: \mu = 40 \\ H_a: \mu\neq 40 \]

Validity Conditions for a One-sample \(t\)-test

The quantitative variable should have a symmetric distribution, or you should have at least 20 observations and the sample distribution should not be strongly skewed.

When these conditions are met we can use the \(t\)-distribution to approximate the \(p\)-value for our hypothesis test. It’s important to keep in mind that these conditions are rough guidelines and not a guarantee. All theory-based methods are approximations. They will work best when the sample distribution is symmetric, the sample size is large, and there are no large outliers. When in doubt, use a simulation-based method as a cross-check.

In this example we have \(n=1381\) observations, which is much larger than 20, and our sample distribution is symmetric as seen above in the histogram.

Calculating the standardized statistic, the \(t\)-statistic

The standardized statistic, \(t\), is found using the formula \[ t = \frac{\bar{x} - \mu}{SE(\bar{x})} \]

and standard error for the null distribution is given by

\[ SE(\bar{x})=\frac{s}{\sqrt{n}}. \]

Calculate the standardized \(t\)- statistic

#calculate the standard deviation of the sample, s
s <- sd( ~number_of_hours_worked_last_week, data=GSS)

# n is the number of observational units (after filtering)
n=1381

#calculate standard error
SE <- s/sqrt(n)

#mu is the mean of the null hypotheses
mu = 40

#now we can calculate (and display) the standardized statistic
t <- (xbar - mu)/SE
t
## [1] 3.289241

We can calculate a \(p\)-value for this hypothesis test using R using the command t.test. As we saw in Lab 2, the options for alternative are “two.sided”, “greater”, “less” depending on the inequality in the alternative hypotheses. We must also enter the null-hypothesis parameter mu (in place of the symbol \(\mu\)).

t.test(~number_of_hours_worked_last_week,  data = GSS, alternative = "two.sided", mu=40)
## 
##  One Sample t-test
## 
## data:  number_of_hours_worked_last_week
## t = 3.2892, df = 1380, p-value = 0.00103
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
##  40.51729 42.04607
## sample estimates:
## mean of x 
##  41.28168

Conclusions

Our data is from a random sample of \(n=1381\) US workers collected through the General Social Survey. Since our sample is random, we may generalize our findings to the larger population of US workers. We consider the number of hours worked last week, a quantitative variable, and investigated whether or not the mean number of hours worked last week by US workers is equal to 40.

What can be concluded from the \(t\)-statistic and \(p\)-value?

Our statistic, the sample mean of \(\bar{x}=41.28\) hours worked last week, is 3.29 standard deviations away from the hypothesized mean of 40 hours worked last week. An observed statistic that is more than 3 standard deviations away from the hypothesized mean, as our is here, is very strong evidence against the null hypothesis. We are very unlikely to obtain a random sample of \(n=1381\) people with sample mean of \(\bar{x}\)=41.28 hours worked last week if the true population of US workers worked an average of 40 hours last week.

Similarly, the \(p\)-value of 0.00103 is very small. When a \(p\)-value is less than 0.01, as ours is here, we have very strong evidence against the null hypothesis. Thus, we will reject the null hypothesis and accept the alternative hypothesis that mean number of hours worked by US workers is not equal to 40 hours per week.

Notice that the \(t\)-statistic and \(p\)-value give the same conclusions, as expected.

Lab 3 Exercises

Data were collected from 65 healthy female volunteers aged 18 to 40 in the United States that were participating in a vaccine trial. The data at the Url: http://www.isi-stats.com/isi/data/chap2/FemaleTemp.txt contains body temperature data from these 65 females.

Research Question.

We will investigate whether the average body temperature of healthy adult females in the U.S. is below or equal to 98.6 degrees Fahrenheit.

  1. Describe the parameter of interest, \(\mu\), in words.

  2. Write out the null and alternative hypotheses in words.

Import Data

  1. Load this data into R below. Name the data Temps
#load the data

Explore the Data

  1. Use commands to view the data and determine the variable name. Create a histogram of the temperature data.

  2. Use R commands to view the data and determine the variable name. Calculate the sample mean, \(\bar{x}\) and the sample standard deviation \(s\).

Validity Conditions

  1. Are the validity conditions met for a theory-based test? Explain how you know.

Theory-based \(t\)-test.

  1. Perform a theory-based \(t\)-test.

Calculate the standardized \(t\)-statistic

  1. Use R as a calculator find the standardized \(t\)-statistic.

Conclusions.

  1. Based on the standardized statistic, what conclusions can you make?

  2. Based on the \(p\)-value, what conclusions can you make?

  3. Do you feel comfortable generalizing your findings to all healthy adult females in the U.S.? Explain.