Math 138, Section 10.3 Linear Regression

Linear Regression

library(mosaic)
library(ggformula)

A colleague went to the lego.com website in February 2014 and recorded the number of pieces and the sales price for 157 Lego products listed there. The data appear in the Legos data file.

Legos <- read.table("http://www.isi-stats.com/isi/data/chap10/legos.txt", header=TRUE)

explanatory/predictor variable:

response variable:

#add labels to the graph
gf_point(price ~ pieces, data=Legos)

What do we see? The scatterplot shows a strong positive linear association between the number of Lego pieces in a set and the price of the set.

Estimate a number for the correlation of the Lego data: 0.9

Calculate the correlation of the Lego data.

r <- cor(price ~ pieces, data=Legos)
r

## [1] 0.9739097

Estimate the slope of the line that best fits the data by picking two points and finding the slope between them. Two points: (1000, 120) and (3500, 400)

#slope = (change in y-values)/(change in x-values)
(400-120)/(3500-1000)

## [1] 0.112

Calculate the line of best fit using the lm( ) function.

lm(price ~ pieces, data=Legos)

## 
## Call:
## lm(formula = price ~ pieces, data = Legos)
## 
## Coefficients:
## (Intercept)       pieces  
##       4.862        0.105

Our regression line (also called the line of best fit) is \[\widehat{\textrm{price}} = 4.862 + 0.105(\textrm{number of pieces})\]

Interpret the slope: For a 1-lego piece increase we would expect 0.105 dollar increase in price. Equivalently, for each additional lego piece the price of the lego set increases by 10.5 cents.

To graph both the data and the line of best fit we will use two new commands:

the command gf_abline(intercept = , slope = ,color= "red") gives the graph of the line of best fit with the numerical values of the intercept and slope included after the equal signs.
the %>% is a ‘piping’ command that tells R to first do the thing before the %>% then do the thing after, then display both. This command layers the two graphs (the data and the line) on top of each other.

gf_point(price ~ pieces, data=Legos) %>%
  gf_abline(intercept=4.862, slope=0.105, color="green")

Predict the price of a Lego set with 2850 pieces.

Predict the price of a Lego set with 5000 pieces.

Do you have any concerns about either of these predictions? Yes, 5000 pieces is extrapolating far beyond the data. Extrapolation should be avoided because it can lead to misleading conclusions.

Coefficient of determination

#first of three ways to calculate R^2, the coefficient of determination
r^2

## [1] 0.9485002

#second of three ways to calculate R^2
rsquared(model1)

## Warning in summary(x, ...): restarting interrupted promise evaluation

## NULL

The coefficient of determination is \(R^2= 0.9485 = 94.85\%\), describes the percentage of the total observed variation in the response variable (price), that is accounted for by changes in the explanatory variable (number of Lego pieces). So 94.85% of the variation in price is attributable to the size (number of Lego pieces) in the set.

There is another way to calculate \(R^2\) using the total sum of squares and the explained sum of squares. We use an ANOVA table to calculate these sums.

aov(price ~ pieces, data=Legos)

## Call:
##    aov(formula = price ~ pieces, data = Legos)
## 
## Terms:
##                   pieces Residuals
## Sum of Squares  505561.8   27450.0
## Deg. of Freedom        1       155
## 
## Residual standard error: 13.30777
## Estimated effects may be unbalanced

#SS(ybar) = SSTotal 
SSTotal = 27450+505561.8

#SS(regression line) = sum of squared residuals = SSError
SSResid = 27450  
SSExplained = (SSTotal-SSResid)

#the third way to calculate R^2
SSExplained/SSTotal

## [1] 0.9485002

Exercises

In 2015, we sampled 13 homes from Zillow that were for sale just north of a small lake in Michigan and recorded the selling price and the square footage of the home. The data can be found at the url below

http://www.isi-stats.com/isi/data/chap10/HousePrices.txt

Load and name the Zillow data.
State the names of the explanatory and response variables in words and the exact name used in the data.

Explanatory:

Response:

View the data with a scatterplot. Label the axes and give your graph a title.
Describe the direction, strength, and form of the data.
Use R to find the regression line (also called the line of best fit) for the Zillow data.
Plot the data and the regression line together.
Interpret the slope and intercept of the line of best fit in context.

slope:

intercept:

Predict the selling price of a 4000 square ft house. Write R code to calculate and display the price
Would you feel more comfortable using the regression line predict the selling price of a 1000 square ft house or for a 4000 square ft house? Explain your choice.
Calculate the coefficient of determination, \(R^2\), using the following two methods: (1) calculate the correlation \(r\), (2) using R to calculate the total sum of squares and the explained sum of squares. Check that \(R^2\) is the same using both methods.

# use method 1: calculate the correlation r, then find R^2

#use method 2: calculate the total sum of squares and explained sum of squares to find R^2

State and interpret the value of \(R^2\) in context of this study.