Lab 8B: Inference for Regression Slope

library(ggformula)
library(mosaic)
#the two packages you need are here

We will use data on the heart rates and body temperatures of healthy adults to see whether there is an association between body temperature and heart rate. The dataset consists of body temperatures (in Fahrenheit) and heart rates (in beats per minute) from 65 females and 65 males.

HRdata <- read.table("http://www.isi-stats.com/isi/data/chap10/TempHeart.txt", header=TRUE)

Let’s take a look at our data.

gf_point(HeartRate ~ BodyTemp, data=HRdata, ylab="Heart rate (bpm)", xlab="body temp (deg F)")

We see a weak positive linear relationship between body temperature and heart rate.

r <- cor(HeartRate ~ BodyTemp, data=HRdata)
r

## [1] 0.2536564

Do we have an outlier? Does the point at approximately (100.75, 75) influence the value of the correlation? That person might have a fever. Here is code to remove that observation and check to see if it is influential.

# select rows of the data where temp is less than or equal to 100.5"
subset_HRdata <- subset(HRdata, BodyTemp <= 100.5)

Re-plot the data

gf_point(HeartRate ~ BodyTemp, data=subset_HRdata)

r <- cor(HeartRate ~ BodyTemp, data=subset_HRdata)
r

## [1] 0.2536827

The correlation didn’t change much at all. The point near (100.5, 75) is not an influential observation. Therefore we will proceed with the original data.

Let’s calculate some favorite statistics. In particular the means and standard deviations of the two variables temp and heart rate.

favstats(~ HeartRate, data=HRdata)

##  min Q1 median Q3 max     mean       sd   n missing
##   57 69     74 79  89 73.76154 7.062077 130       0

favstats(~BodyTemp, data=HRdata)

##   min   Q1 median   Q3   max     mean        sd   n missing
##  96.3 97.8   98.3 98.7 100.8 98.24923 0.7331832 130       0

Regression Line

It is always the case that the point \((\bar{x}_{explanatory}, \bar{x}_{response})\) is a point on the line of best fit (the regression line).

This means the point \((\bar{x}_{temp}, \bar{x}_{hr}) = (98.25, 73.76)\) will be a point on our line of best fit.

We will store the values of the standard deviation for each variable.

s_temp = 0.7331832
s_hr = 7.062077

The values of standard deviation can be used to calculate the slope of the regression line using the formula:

The relationship between the correlation \(r\) and the slope \(b\) is given by

\[b = r \frac{s_y}{s_x}.\]

We can use this formula to preview the slope of our linear model

s_y = s_hr
s_x = s_temp
slope = r*(s_y/s_x)
slope

## [1] 2.443492

Now we have the slope \(b=2.443238\) and a point on our regression line \((98.25, 73.76)\), so we can solve to find the intercept

\[73.76 = a + 2.443238(98.25)\]

a = 73.76 - 2.443238*(98.25) 
a

## [1] -166.2881

Putting \(a\) and \(b\) into the slope-intercept linear equation we get

\[ \hat{y} = -166.2881 + 2.443238x. \]

We can also calculate the coefficients \(a\), the intercept, and \(b\), the slope, for our line of best fit using the linear model command lm( response ~ predictor, data= ).

lm(HeartRate ~ BodyTemp, data=HRdata)

## 
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
## 
## Coefficients:
## (Intercept)     BodyTemp  
##    -166.285        2.443

We get the same line of best fit as seem previosly. Writtyen in the context of this study we have \[\widehat{HeartRate} = -166.285 + 2.443 \times( BodyTemp).\] To graph both the data and the line of best fit we will use two new commands from Lab 8A:

the command gf_abline(intercept = , slope = ,color= "red") gives the graph of the line of best fit with the numerical values of the intercept and slope included after the equal signs.
the %>% is a ‘piping’ command that tells R to first do the thing before the %>% then do the thing after, then display both. This command layers the two graphs (of the data and the line) one on top of the other.

gf_point(HeartRate ~ BodyTemp, data=HRdata) %>%
  gf_abline(intercept=-166.285, slope=2.443, color="purple")

Our goal is to perform a hypothesis test to determine if the linear relationship given by the regression line is statistically significant.

Null: there is no linear association between body temp and heart rate. Alternative: there is a linear association between body temp and heart rate.

Equivalently, in terms of slope, we let \(\beta\) represent the slope of the regression line for the entire population of healthy adults.

\[H_0: \beta = 0 \\ H_a: \beta \neq 0 \]

Validity conditions for regression with slope (or correlation)

The general pattern of the points in the scatterplot should follow a linear trend; the pattern should not show curved or other nonlinear patterns.
There should be approximately the same distribution of points above the regression line as below the regression line (symmetry about the regression line).
The variability of the points around the regression line should be similar regardless of the value of the explanatory variable; the variability (spread) of the points around the regression line should not change as you slide along the x-axis (equal variance/standard deviation).

Looking back at the data and regression line plot, the pattern of the points in the scatterplot doesn’t look curved; a linear trend seems reasonable. Second, we note that the points above the regression line appear similar in spread and shape as the points below the regression line (like a mirror image to the pattern). Finally, the variability in heart rates is approximately the same for different body temperatures (e.g., we don’t see that low body temperatures have low variability in heart rates and high body have high variability in heart rates or vice versa). When checking validity conditions, we use logic much like we do when testing hypotheses—we will assume the condition is true unless we see strong evidence that it is not.

Standardized \(t\)-statistic

Before calculating p-values we calculate the standardized \(t\)-statistic. There are two formulas we can use to calculate the standardized statistic, one uses the correlation \(r\) and the other uses the slope \(b\). Recall that these two numbers, \(r\) and \(b\), are related by a ratio of the standard deviations of our variables.

\[t = \frac{r}{ \sqrt{\frac{1-r^2}{n-2}} } \textrm{ and } \]
\[t = \frac{b-0}{SE(b)}\]

The first formula for the standardized statistic uses the sample size \(n\) and the correlation coefficient, \(r= 0.2536564\). The denominator \(\sqrt{\frac{1-r^2}{n-2}}\) is the standard error for the correlation coefficient.

The second formula uses the slope \(b=2.443\) and the standard error of the distribution of slopes, \(SE(b)\). The standard error \(SE(b)\) can be obtained from a simulation-based distribution of shuffled slopes or by theory-based techniques shown below.

Using the first formula with correlation, we can calculate \(t\) as follows.

n = 130
t_wcor = r/sqrt((1-r^2)/(n-2))  
t_wcor

## [1] 2.967156

We can view (and use) more of the information generated by the linear model command lm( ) by assigning the output a name, such as m1 short for model 1. Then we can display a summary of the output using the summary( ) command.

m1 <- lm(HeartRate ~ BodyTemp, data=HRdata)
summary(m1)

## 
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6413  -4.6356   0.3247   4.8304  15.8474 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -166.2847    80.9123  -2.055  0.04190 * 
## BodyTemp       2.4432     0.8235   2.967  0.00359 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.858 on 128 degrees of freedom
## Multiple R-squared:  0.06434,    Adjusted R-squared:  0.05703 
## F-statistic: 8.802 on 1 and 128 DF,  p-value: 0.003591

Notice that this output contains the intercept and slope of the regression line, but the values are stacked vertically instead of displayed horizontally. Also notice that the output the \(t\)-statistic \(= 2.967\) is the same as we calculated above. Next to the \(t\)-statistic is a \(p\)-value \(= 0.00359\). This represents the probability of seeing a standardized statistic of 2.967 or more extreme if the null-hypothesis of no linear association is true. Note that the \(P(>|t|)\) is telling us this is a two-sided \(p\)-value. If we are performing a one-sided hypothesis test we would divide the \(p\)-value by 2.

Additionally, from the output above we can confirm the calculation of the \(t\) statistic, \(t = \frac{b-0}{SE(b)}\).

b=2.4432
SEb_theory = 0.8235
t_wslope = b/SEb_theory
t_wslope

## [1] 2.966849

Confidence intervals for slope.

Using the SE(b) = 0.8235 we can calculate a 2SD Confidence interval:

# upper and lower endpoints of a 95% confidence interval for the slope beta.
upper = 2.4432 + 2*0.8235
lower = 2.4432 - 2*0.8235

lower

## [1] 0.7962

upper

## [1] 4.0902

We can calculate the theory-based 95% confidence interval using the confint( ) function.

confint(m1, level=0.95)

##                   2.5 %    97.5 %
## (Intercept) -326.383620 -6.185819
## BodyTemp       0.813765  4.072711

Notice that the 2SD method gave us an interval that is wider than necessary, but the 2SD method has the benefit of being a quick easy calculation that we can perform without technology.

Conclusions:

Strength of Evidence with context We have a \(p\)-value of \(= 0.00359\) and a standardized statistic of \(2.967\), which give us very strong and strong evidence against the null hypothesis respectively. Thus, our our evidence supports the conclusion that there is a statistically significant linear association between body temperature and heart rate.

Estimation with context: Our 2SD confidence interval is \((0.7962, 4.0902)\) and theory-based 95% confidence interval is (0.814, 4.073). Therefore, we are 95% confident that a one degree increase in body temperature is associated with an average increase of 0.814 to 4.073 heart beats per minute.

Causation and Generalization: Since this is an observational study of 130 healthy adults, so we cannot draw any conclusions regarding causation. We also cannot generalize since it was not indicated that our data is a simple random sample of healthy adults; for all we know it could be a convenience sample of 18-25 year old adults.

Exercises

Hand span and candy grab. Is hand span a good predictor of how much candy you can grab? Using 45 college students as subjects, researchers set out to explore whether a linear relationship exists between hand span (cm) and the number of Tootsie Rolls® each subject could grab.

Url for data: http://www.isi-stats.com/isi/data/chap10/HandSpan.txt

Load and name the Tootsie roll data.

#load the data

Record the observational units. Record the variables with units, variable type, and exact variable name from the data.

observational units:

explanatory/predictor:

response:

State the null and alternative hypothesis to be investigated in this study in words. Write your words below in place of the ....

\[H_0: \ \dots \ \dots \]

\[H_a: \ \dots \ \dots \]

Use R to calculate means, standard deviations, sample size and the correlation coefficient from the data. Name and record these values for later use.

#means and standard deviations

#correlation coefficient

Create a scatterplot of the number of tootsie rolls in terms of the hand span. Label your graph and the axes appropriately.

# create a labeled and titled scatterplot of the data

Does there appear to be an association between hand span and the number of tootsie rolls grabbed? Describe the strength, direction, and form of the data.

Association?

Data shape?

Using your answers to question 4 to find the regression line that predicts number of tootsie rolls in terms of hand span.

# calculate the intercept and slope of the regression line using output from 4.

Regression line slope:

Regression line intercept:

Use R command(s) to find the regression line. Is your line the same as the line found in exercise 7? Edit the formula below by inserting the numeric values of \(a\) and \(b\) from your code.

#find the equation of the regression line

Regression line \[ \widehat{TootsieRs} = a + b (Hand \ Span)\]

Interpret the slope of the regression line in context.
Graph the data and regression line together in one plot.

#graph regression line and data

Are the validity conditions satisfied for theory-based inference? Explain what you are checking and your conclusions. Hint: you should be checking three things.
Calculate the standardized \(t\)-statistic using the numbers from Exercise 4.
Use R to calculate the \(t\)-statistic and \(p\)-value for the hypothesis test.
Calculate the 2SD confidence interval for the slope that would describe the association between all hand spans and number of tootsie rolls grabbed. Is zero in this interval?

#2SD confidence interval

Calculate the coefficient of determination, \(R^2\), in two different ways and explain its meaning in the context of this investigation.

#calculate R^2 in two different ways.

Meaning of \(R^2\):

Using your calculations above, provide a conclusion for this statistical study.

Strength of evidence and conclusion in context:

Estimation with explanation in context:

Generalization:

Causation: