We will use data on the heart rates and body temperatures of healthy adults to see whether there is an association between body temperature and heart rate. The dataset consists of body temperatures (in Fahrenheit) and heart rates (in beats per minute) from 65 females and 65 males.
Let’s take a look at our data.
We see a weak positive linear relationship between body temperature and heart rate.
## [1] 0.2536564
Do we have an outlier? Does the point at approximately (100.75, 75) influence the value of the correlation? That person might have a fever. Here is code to remove that observation and check to see if it is influential.
# select rows of the data where temp is less than or equal to 100.5"
subset_HRdata <- subset(HRdata, BodyTemp <= 100.5)
Re-plot the data
## [1] 0.2536827
The correlation didn’t change much at all. The point near (100.5, 75) is not an influential observation. Therefore we will proceed with the original data.
Let’s calculate some favorite statistics of the two variables temp and heart rate, in particular the means and standard deviations, and record their values for later calculations.
## min Q1 median Q3 max mean sd n missing
## 96.3 97.8 98.3 98.7 100.8 98.24923 0.7331832 130 0
## min Q1 median Q3 max mean sd n missing
## 57 69 74 79 89 73.76154 7.062077 130 0
\[ \widehat{ResponseVariable} = Intercept + Slope \times (ExplanatoryVariable)\]
It is always the case that the point \((\bar{x}_{explanatory}, \bar{x}_{response})\) is a point on the line of best fit (the regression line).
This means the point \((\bar{x}_{temp}, \bar{x}_{hr}) = (98.25, 73.76)\) will be a point on our line of best fit.
The standard deviation for each variable along with the correlation, \(r\), can be used to calculate the slope of the regression line using the formula:\[b = r \frac{s_y}{s_x}.\]
We can use this formula to preview the slope of our linear model
## [1] 2.443466
Now we have the slope \(b=2.443238\) and a point on our regression line \((98.25, 73.76)\), so we can solve to find the intercept
\[73.76 = a + 2.443238(98.25)\]
## [1] -166.2881
Putting \(a\) and \(b\) into the slope-intercept linear equation we get
\[ \hat{y} = -166.2881 + 2.443238x. \]
We can also calculate the coefficients \(a\), the intercept, and \(b\), the slope, for our line of best fit
using the linear model command
lm( response ~ predictor, data= )
.
##
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
##
## Coefficients:
## (Intercept) BodyTemp
## -166.285 2.443
We get the same line of best fit as seen previously. We write the formula in the context of this study as\[\widehat{HeartRate} = -166.285 + 2.443 \times( BodyTemp).\] To graph both the data and the line of best fit we will use two new commands
the command
gf_abline(intercept = , slope = ,color= "red")
gives
the graph of the line of best fit with the numerical values of the
intercept and slope included after the equal signs.
the %>%
is a ‘piping’ command that tells R to
first do the thing before the piping %>%
then do the
thing after, then display both. This command layers the two
graphs (of the data and the line) one on top of the other.
gf_point(HeartRate ~ BodyTemp, data=HRdata) %>%
gf_abline(intercept=-166.285, slope=2.443, color="purple")
Our goal is to perform a hypothesis test to determine if the linear relationship given by the regression line is statistically significant.
Null: there is no linear association between body temp and heart rate. Alternative: there is a linear association between body temp and heart rate.
Equivalently, in terms of slope, we let \(\beta\) represent the slope of the regression line for the entire population of healthy adults.
\[H_0: \beta = 0 \\ H_a: \beta \neq 0 \]
The general pattern of the points in the scatterplot should follow a linear trend; the pattern should not show curved or other nonlinear patterns.
There should be approximately the same distribution of points above the regression line as below the regression line (symmetry about the regression line).
The variability of the points around the regression line should be similar regardless of the value of the explanatory variable; the variability (spread) of the points around the regression line should not change as you slide along the x-axis (equal variance/standard deviation).
Looking back at the data and regression line plot, the pattern of the points in the scatterplot do not look curved; a linear trend seems reasonable. Second, we note that the points above the regression line appear similar in spread and shape as the points below the regression line (close to a mirror image pattern across the regression line). Finally, the variability in heart rates is approximately the same for different body temperatures (e.g., we don’t see that low body temperatures have low variability in heart rates and high body have high variability in heart rates or vice versa). When checking validity conditions, we use logic much like we do when testing hypotheses—we will assume the condition is true unless we see strong evidence that it is not.
Before calculating p-values we calculate the standardized \(t\)-statistic. There are two formulas we can use to calculate the standardized statistic, one uses the correlation \(r\) and the other uses the slope \(b\). Recall that these two numbers, \(r\) and \(b\), are related by a ratio of the standard deviations of our variables.
\[t = \frac{r}{ \sqrt{\frac{1-r^2}{n-2}} }
\textrm{ and } \]
\[t = \frac{b-0}{SE(b)}\]
The first formula for the standardized statistic uses the sample size \(n\) and the correlation coefficient, \(r= 0.25368\). The denominator \(\sqrt{\frac{1-r^2}{n-2}}\) is the standard error for the correlation coefficient.
The second formula uses the slope \(b=2.443\) and the standard error of the distribution of slopes, \(SE(b)\). The standard error \(SE(b)\) can be obtained from a simulation-based distribution of shuffled slopes or by theory-based techniques shown below.
Using the first formula with correlation, we can calculate the standardized \(t\)-statistic as follows.
## [1] 2.967156
We can view (and use) more of the information generated by the linear
model command lm( )
by assigning the output a name, such as
m1
short for model 1. Then we can display a summary of the
output using the summary( )
command.
##
## Call:
## lm(formula = HeartRate ~ BodyTemp, data = HRdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.6413 -4.6356 0.3247 4.8304 15.8474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -166.2847 80.9123 -2.055 0.04190 *
## BodyTemp 2.4432 0.8235 2.967 0.00359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.858 on 128 degrees of freedom
## Multiple R-squared: 0.06434, Adjusted R-squared: 0.05703
## F-statistic: 8.802 on 1 and 128 DF, p-value: 0.003591
Notice that this output contains the intercept and slope of the regression line, but the values are stacked vertically instead of displayed horizontally. Also notice that the output the \(t\)-statistic \(= 2.967\) is the same as we calculated above. Next to the \(t\)-statistic is a \(p\)-value \(= 0.00359\). This represents the probability of seeing a standardized statistic of 2.967 or more extreme if the null-hypothesis of no linear association is true. Note that the \(P(>|t|)\) is telling us this is a two-sided \(p\)-value. If we are performing a one-sided hypothesis test we would divide the \(p\)-value by 2.
Additionally, from the output above we can confirm the calculation of the \(t\) statistic, \(t = \frac{b-0}{SE(b)}\).
## [1] 2.966849
Using the SE(b) = 0.8235
we can calculate a 2SD
Confidence interval:
# upper and lower endpoints of a 95% confidence interval for the slope beta.
upper = b + 2*SEb
lower = b - 2*SEb
lower
## [1] 0.7962
## [1] 4.0902
We can calculate the theory-based 95% confidence interval using the
confint( )
function.
## 2.5 % 97.5 %
## (Intercept) -326.383620 -6.185819
## BodyTemp 0.813765 4.072711
Notice that the 2SD method gave us an interval that is slightly wider than necessary, but the 2SD method has the benefit of being a quick easy calculation that we can perform without technology.
Conclusions:
Strength of Evidence with context We have a \(p\)-value of \(= 0.00359\) and a standardized statistic of \(2.967\), which give us very strong and strong evidence against the null hypothesis respectively. Thus, our our evidence supports the conclusion that there is a statistically significant linear association between body temperature and heart rate.
Estimation with context: Our 2SD confidence interval is \((0.7962, 4.0902)\) and theory-based 95% confidence interval is (0.814, 4.073). Therefore, we are 95% confident that a one degree increase in body temperature is associated with an average increase of 0.814 to 4.073 heart beats per minute.
Causation and Generalization: Since this is an observational study of 130 healthy adults, so we cannot draw any conclusions regarding causation. We also cannot generalize since it was not indicated that our data is a random sample of healthy adults; for all we know it could be a convenience sample of 18-25 year old adults.
Hand span and candy grab. Is hand span a good predictor of how much candy you can grab? Last class we collected data on our hand span (in centimeters) and how many Tootsie Rolls® we could grab. In this lab we explore our class data to determine whether or not a positive linear association exists between these variables.
Url for class data: http://raw.githubusercontent.com/IJohnson-math/Math138/main/HandSpanClassData.csv
observational units:
explanatory/predictor:
response:
...
.\[H_0: \ \dots \ \dots \]
\[H_a: \ \dots \ \dots \]
Does there appear to be an association between hand span and the number of tootsie rolls grabbed? Describe the strength, direction, and form of the data.
Using your answers to question 4 to find the slope and intercept of the regression line that predicts number of tootsie rolls in terms of hand span.
Regression line slope:
Regression line intercept:
Regression line \[ \widehat{TootsieRs} = a + b (Hand \ Span)\]
Are the validity conditions satisfied for theory-based inference? Explain what you are checking and your conclusions. Hint: you should be checking three things.
Calculate the standardized \(t\)-statistic using the numbers from Exercise 4.
Use R to calculate the \(t\)-statistic and \(p\)-value for the hypothesis test. Check your answer with the answer to question 12.
Calculate the 2SD confidence interval for the slope that would describe the association between all hand spans and number of tootsie rolls grabbed. Is zero in this interval?
Strength of evidence and conclusion in context:
Estimation with explanation in context:
Generalization:
Causation: