library(mosaic)
library(ggformula)
A colleague went to the lego.com website in February 2014 and recorded the number of pieces and the sales price for 157 Lego products listed there. The data appear in the Legos data file.
<- read.table("http://www.isi-stats.com/isi/data/chap10/legos.txt", header=TRUE) Legos
explanatory/predictor variable:
response variable:
#add labels to the graph
gf_point(price ~ pieces, data=Legos)
What do we see? The scatterplot shows a strong positive linear association between the number of Lego pieces in a set and the price of the set.
Estimate a number for the correlation of the Lego data: 0.9
Calculate the correlation of the Lego data.
<- cor(price ~ pieces, data=Legos)
r r
## [1] 0.9739097
Estimate the slope of the line that best fits the data by picking two points and finding the slope between them. Two points: (1000, 120) and (3500, 400)
#slope = (change in y-values)/(change in x-values)
400-120)/(3500-1000) (
## [1] 0.112
Calculate the line of best fit using the lm( )
function.
lm(price ~ pieces, data=Legos)
##
## Call:
## lm(formula = price ~ pieces, data = Legos)
##
## Coefficients:
## (Intercept) pieces
## 4.862 0.105
Our regression line (also called the line of best fit) is \[\widehat{\textrm{price}} = 4.862 + 0.105(\textrm{number of pieces})\]
Interpret the slope: For a 1-lego piece increase we would expect 0.105 dollar increase in price. Equivalently, for each additional lego piece the price of the lego set increases by 10.5 cents.
To graph both the data and the line of best fit we will use two new commands:
the command
gf_abline(intercept = , slope = ,color= "red")
gives
the graph of the line of best fit with the numerical values of the
intercept and slope included after the equal signs.
the %>%
is a ‘piping’ command that tells R to
first do the thing before the %>%
then do the thing
after, then display both. This command layers the two graphs
(the data and the line) on top of each other.
gf_point(price ~ pieces, data=Legos) %>%
gf_abline(intercept=4.862, slope=0.105, color="green")
Predict the price of a Lego set with 2850 pieces.
Predict the price of a Lego set with 5000 pieces.
Do you have any concerns about either of these predictions? Yes, 5000 pieces is extrapolating far beyond the data. Extrapolation should be avoided because it can lead to misleading conclusions.
#first of three ways to calculate R^2, the coefficient of determination
^2 r
## [1] 0.9485002
#second of three ways to calculate R^2
rsquared(model1)
## Warning in summary(x, ...): restarting interrupted promise evaluation
## NULL
The coefficient of determination is \(R^2= 0.9485 = 94.85\%\), describes the percentage of the total observed variation in the response variable (price), that is accounted for by changes in the explanatory variable (number of Lego pieces). So 94.85% of the variation in price is attributable to the size (number of Lego pieces) in the set.
There is another way to calculate \(R^2\) using the total sum of squares and the explained sum of squares. We use an ANOVA table to calculate these sums.
aov(price ~ pieces, data=Legos)
## Call:
## aov(formula = price ~ pieces, data = Legos)
##
## Terms:
## pieces Residuals
## Sum of Squares 505561.8 27450.0
## Deg. of Freedom 1 155
##
## Residual standard error: 13.30777
## Estimated effects may be unbalanced
#SS(ybar) = SSTotal
= 27450+505561.8
SSTotal
#SS(regression line) = sum of squared residuals = SSError
= 27450
SSResid = (SSTotal-SSResid)
SSExplained
#the third way to calculate R^2
/SSTotal SSExplained
## [1] 0.9485002
In 2015, we sampled 13 homes from Zillow that were for sale just north of a small lake in Michigan and recorded the selling price and the square footage of the home. The data can be found at the url below
http://www.isi-stats.com/isi/data/chap10/HousePrices.txt
Load and name the Zillow data.
State the names of the explanatory and response variables in words and the exact name used in the data.
Explanatory:
Response:
View the data with a scatterplot. Label the axes and give your graph a title.
Describe the direction, strength, and form of the data.
Use R to find the regression line (also called the line of best fit) for the Zillow data.
Plot the data and the regression line together.
Interpret the slope and intercept of the line of best fit in context.
slope:
intercept:
Predict the selling price of a 4000 square ft house. Write R code to calculate and display the price
Would you feel more comfortable using the regression line predict the selling price of a 1000 square ft house or for a 4000 square ft house? Explain your choice.
Calculate the coefficient of determination, \(R^2\), using the following two methods: (1) calculate the correlation \(r\), (2) using R to calculate the total sum of squares and the explained sum of squares. Check that \(R^2\) is the same using both methods.
# use method 1: calculate the correlation r, then find R^2
#use method 2: calculate the total sum of squares and explained sum of squares to find R^2