PreLab 7: Inference for Regression Slope

library(ggformula)
library(mosaic)
#the two packages you need are here

We will use data on the heart rates and body temperatures of healthy adults to see whether there is an association between body temperature and heart rate. The dataset consists of body temperatures (in Fahrenheit) and heart rates (in beats per minute) from 65 females and 65 males.

HRdata <- read.table("http://www.isi-stats.com/isi/data/chap10/TempHeart.txt", header=TRUE)

Let’s take a look at our data.

gf_point(HeartRate ~ BodyTemp, data=HRdata, ylab="Heart rate (bpm)", xlab="body temp (deg F)")

We see a weak positive linear relationship between body temperature and heart rate.

r <- cor(HeartRate ~ BodyTemp, data=HRdata)
r

## [1] 0.2536564

Do we have an outlier? Does the point at approximately (100.75, 75) influence the value of the correlation? That person might have a fever. Here is code to remove that observation and check to see if it is influential.

# select rows of the data where temp is less than or equal to 100.5"
subset_HRdata <- subset(HRdata, BodyTemp <= 100.5)

Re-plot the data

gf_point(HeartRate ~ BodyTemp, data=subset_HRdata)

r <- cor(HeartRate ~ BodyTemp, data=subset_HRdata)
r

## [1] 0.2536827

The correlation didn’t change much at all. The point near (100.5, 75) is not an influential observation. Therefore we will proceed with the original data.

Let’s calculate some favorite statistics of the two variables temp and heart rate, in particular the means and standard deviations, and record their values for later calculations.

favstats(~BodyTemp, data=HRdata)

##   min   Q1 median   Q3   max     mean        sd   n missing
##  96.3 97.8   98.3 98.7 100.8 98.24923 0.7331832 130       0

meanBT = 98.24923   
sdBT= 0.733183

favstats(~ HeartRate, data=HRdata)

##  min Q1 median Q3 max     mean       sd   n missing
##   57 69     74 79  89 73.76154 7.062077 130       0

meanHR = 73.76
sdHR = 7.062

Regression Line

\[ \widehat{ResponseVariable} = Intercept + Slope \times (ExplanatoryVariable)\]

It is always the case that the point \((\bar{x}_{explanatory}, \bar{x}_{response})\) is a point on the line of best fit (the regression line).

This means the point \((\bar{x}_{temp}, \bar{x}_{hr}) = (98.25, 73.76)\) will be a point on our line of best fit.

The standard deviation for each variable along with the correlation, \(r\), can be used to calculate the slope of the regression line using the formula:\[b = r \frac{s_y}{s_x}.\]

We can use this formula to preview the slope of our linear model

s_y = sdHR
s_x = sdBT
slope = r*(s_y/s_x)
slope

## [1] 2.443466

Now we have the slope \(b=2.443238\) and a point on our regression line \((98.25, 73.76)\), so we can solve to find the intercept

\[73.76 = a + 2.443238(98.25)\]

a = 73.76 - 2.443238*(98.25) 
a

## [1] -166.2881

Putting \(a\) and \(b\) into the slope-intercept linear equation we get

\[ \hat{y} = -166.2881 + 2.443238x. \]

We can also calculate the coefficients \(a\), the intercept, and \(b\), the slope, for our line of best fit using the linear model command lm( response ~ predictor, data= ).

lm(HeartRate ~ BodyTemp, data=HRdata)

## 
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
## 
## Coefficients:
## (Intercept)     BodyTemp  
##    -166.285        2.443

We get the same line of best fit as seen previously. We write the formula in the context of this study as\[\widehat{HeartRate} = -166.285 + 2.443 \times( BodyTemp).\] To graph both the data and the line of best fit we will use two new commands

the command gf_abline(intercept = , slope = ,color= "red") gives the graph of the line of best fit with the numerical values of the intercept and slope included after the equal signs.
the %>% is a ‘piping’ command that tells R to first do the thing before the piping %>% then do the thing after, then display both. This command layers the two graphs (of the data and the line) one on top of the other.

gf_point(HeartRate ~ BodyTemp, data=HRdata) %>%
  gf_abline(intercept=-166.285, slope=2.443, color="purple")

Our goal is to perform a hypothesis test to determine if the linear relationship given by the regression line is statistically significant.

Null: there is no linear association between body temp and heart rate. Alternative: there is a linear association between body temp and heart rate.

Equivalently, in terms of slope, we let \(\beta\) represent the slope of the regression line for the entire population of healthy adults.

\[H_0: \beta = 0 \\ H_a: \beta \neq 0 \]

Validity conditions for regression with slope (or correlation)

The general pattern of the points in the scatterplot should follow a linear trend; the pattern should not show curved or other nonlinear patterns.
There should be approximately the same distribution of points above the regression line as below the regression line (symmetry about the regression line).
The variability of the points around the regression line should be similar regardless of the value of the explanatory variable; the variability (spread) of the points around the regression line should not change as you slide along the x-axis (equal variance/standard deviation).

Looking back at the data and regression line plot, the pattern of the points in the scatterplot do not look curved; a linear trend seems reasonable. Second, we note that the points above the regression line appear similar in spread and shape as the points below the regression line (close to a mirror image pattern across the regression line). Finally, the variability in heart rates is approximately the same for different body temperatures (e.g., we don’t see that low body temperatures have low variability in heart rates and high body have high variability in heart rates or vice versa). When checking validity conditions, we use logic much like we do when testing hypotheses—we will assume the condition is true unless we see strong evidence that it is not.

Standardized \(t\)-statistic

Before calculating p-values we calculate the standardized \(t\)-statistic. There are two formulas we can use to calculate the standardized statistic, one uses the correlation \(r\) and the other uses the slope \(b\). Recall that these two numbers, \(r\) and \(b\), are related by a ratio of the standard deviations of our variables.

\[t = \frac{r}{ \sqrt{\frac{1-r^2}{n-2}} } \textrm{ and } \]
\[t = \frac{b-0}{SE(b)}\]

The first formula for the standardized statistic uses the sample size \(n\) and the correlation coefficient, \(r= 0.25368\). The denominator \(\sqrt{\frac{1-r^2}{n-2}}\) is the standard error for the correlation coefficient.

The second formula uses the slope \(b=2.443\) and the standard error of the distribution of slopes, \(SE(b)\). The standard error \(SE(b)\) can be obtained from a simulation-based distribution of shuffled slopes or by theory-based techniques shown below.

Using the first formula with correlation, we can calculate the standardized \(t\)-statistic as follows.

n = 130
t_wcor = r/sqrt((1-r^2)/(n-2))  
t_wcor

## [1] 2.967156

We can view (and use) more of the information generated by the linear model command lm( ) by assigning the output a name, such as m1 short for model 1. Then we can display a summary of the output using the summary( ) command.

m1 <- lm(HeartRate ~ BodyTemp, data=HRdata)
summary(m1)

## 
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6413  -4.6356   0.3247   4.8304  15.8474 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -166.2847    80.9123  -2.055  0.04190 * 
## BodyTemp       2.4432     0.8235   2.967  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.858 on 128 degrees of freedom
## Multiple R-squared:  0.06434,    Adjusted R-squared:  0.05703 
## F-statistic: 8.802 on 1 and 128 DF,  p-value: 0.003591

Notice that this output contains the intercept and slope of the regression line, but the values are stacked vertically instead of displayed horizontally. Also notice that the output the \(t\)-statistic \(= 2.967\) is the same as we calculated above. Next to the \(t\)-statistic is a \(p\)-value \(= 0.00359\). This represents the probability of seeing a standardized statistic of 2.967 or more extreme if the null-hypothesis of no linear association is true. Note that the \(P(>|t|)\) is telling us this is a two-sided \(p\)-value. If we are performing a one-sided hypothesis test we would divide the \(p\)-value by 2.

Additionally, from the output above we can confirm the calculation of the \(t\) statistic, \(t = \frac{b-0}{SE(b)}\).

b=2.4432
SEb = 0.8235
t_wslope = b/SEb
t_wslope

## [1] 2.966849

Confidence intervals for slope.

Using the SE(b) = 0.8235 we can calculate a 2SD Confidence interval:

# upper and lower endpoints of a 95% confidence interval for the slope beta.
upper = b + 2*SEb
lower = b - 2*SEb

lower

## [1] 0.7962

upper

## [1] 4.0902

We can calculate the theory-based 95% confidence interval using the confint( ) function.

confint(m1, level=0.95)

##                   2.5 %    97.5 %
## (Intercept) -326.383620 -6.185819
## BodyTemp       0.813765  4.072711

Notice that the 2SD method gave us an interval that is slightly wider than necessary, but the 2SD method has the benefit of being a quick easy calculation that we can perform without technology.

Conclusions:

Strength of Evidence with context We have a \(p\)-value of \(= 0.00359\) and a standardized statistic of \(2.967\), which give us very strong and strong evidence against the null hypothesis respectively. Thus, our our evidence supports the conclusion that there is a statistically significant linear association between body temperature and heart rate.

Estimation with context: Our 2SD confidence interval is \((0.7962, 4.0902)\) and theory-based 95% confidence interval is (0.814, 4.073). Therefore, we are 95% confident that a one degree increase in body temperature is associated with an average increase of 0.814 to 4.073 heart beats per minute.

Causation and Generalization: Since this is an observational study of 130 healthy adults, so we cannot draw any conclusions regarding causation. We also cannot generalize since it was not indicated that our data is a random sample of healthy adults; for all we know it could be a convenience sample of 18-25 year old adults.

Exercises

Hand span and candy grab. Is hand span a good predictor of how much candy you can grab? Last class we collected data on our hand span (in centimeters) and how many Tootsie Rolls® we could grab. In this lab we explore our class data to determine whether or not a positive linear association exists between these variables.

Url for class data: http://raw.githubusercontent.com/IJohnson-math/Math138/main/HandSpanClassData.csv

Load and name the Tootsie roll data.

#load the data

Record the observational units. Record the variables with units, variable type, and exact variable name from the data.

observational units:

explanatory/predictor:

response:

State the null and alternative hypothesis to be investigated in this study in words. Write your words below in place of the ....

\[H_0: \ \dots \ \dots \]

\[H_a: \ \dots \ \dots \]

Use R to calculate means, standard deviations, sample size and the correlation coefficient from the data. Name and record these values for later use.

#means and standard deviations

#correlation coefficient

Create a scatterplot of the number of tootsie rolls in terms of the hand span. Label your graph and the axes appropriately.

# create a labeled and titled scatterplot of the data

Does there appear to be an association between hand span and the number of tootsie rolls grabbed? Describe the strength, direction, and form of the data.
Using your answers to question 4 to find the slope and intercept of the regression line that predicts number of tootsie rolls in terms of hand span.

# calculate the intercept and slope of the regression line using output from 4.

Regression line slope:

Regression line intercept:

Use R command(s) to find the slope and intercept of the regression line. Are your slope and intercept the same as those found in exercise 7?

#find the equation of the regression line

Edit the formula below by inserting the numeric values of \(a\) and \(b\) from your code.

Regression line \[ \widehat{TootsieRs} = a + b (Hand \ Span)\]

Graph the data and regression line together in one plot.

#graph regression line and data

Are the validity conditions satisfied for theory-based inference? Explain what you are checking and your conclusions. Hint: you should be checking three things.
Calculate the standardized \(t\)-statistic using the numbers from Exercise 4.
Use R to calculate the \(t\)-statistic and \(p\)-value for the hypothesis test. Check your answer with the answer to question 12.
Calculate the 2SD confidence interval for the slope that would describe the association between all hand spans and number of tootsie rolls grabbed. Is zero in this interval?

#2SD confidence interval

Using your calculations above, provide a conclusion for this statistical study.

Strength of evidence and conclusion in context:

Estimation with explanation in context:

Generalization:

Causation: