Lab:1 Introduction to R and RStudio

Goals for this lab.

The first goal of this lab is to introduce you to R and RStudio. We will be using this software throughout the course both to learn the statistical concepts discussed in the texbook and also to analyze real data. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.
Another goal is to learn how to import data.
We will also learn a few commands to view the data and create basic summaries in R.
We will learn how to create basic bar charts and dot plots.

At the end of the in-class portion of the lab you will be asked to apply what you have learned to a new data set. Your solutions will be written in an Rmarkdown document that will contain your code and the output of your code.

The last goal of the lab is to learn how to create an Rmarkdown document, how to enter text and code into the document, and how to knit the document which creates a pdf file that you can turn in on Gradescope.

We begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and some basic commands.

The RStudio interface.

Initially there are three panes in the RStudio interface. The pane in the upper right contains your Environment workspace as well as a History of the commands that you’ve previously entered. When you import data or define a variable they will appear in your Environment.

Any Plots that you generate or files that you upload will show up in the pane in the lower right corner. The lower right also contains a list of installed Packages that you can click on to put in your working library. R packages have complex dependency relationships, but often if you need a package installed then R will ask if you want to install it.

The pane on the left is where the action happens. It’s called the Console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

To get you started, enter the following command at the R prompt (i.e. right after > on the console). Type the command in manually exactly as it is displayed below.

\[ \textrm{ install.packages(''readr") }\] This command makes the ‘readr’ package available for us to use. As the name suggests, the ‘readr’ library is used to read a data file into R.

Next use the same ‘install.packages( )’ command to install three additional packages:

the ‘tidyverse’ package that contains tools for creating graphics and data manipulation,
the ‘rmarkdown’ package will be used to format your lab reports, and
the ‘openintro’ package that contains a template that you can use for your lab reports.

The installation step above will only need to be done once. After installing these four packages they will be listed in the lower right pane under Packages. You can put a checkmark next to these four packages to use the packages or, alternatively, you can enter the library( ) commands shown below. (From here on you can copy and paste the commands.)

#copy and paste these commands into the console
library(readr)
library(tidyverse)
library(rmarkdown)
library(openintro)

R as a calculator.

Before doing anything fancy, notice that R can be used as a calculator. Run the code and calculate the value of z by hand to verify the answer.

#enter these commands into R, run the code, then alter it slightly to see what happens.
x <- 5
y <- 7
q <- 2
z <- (x^2 + 3*y - 10)/q
z

## [1] 18

R can also be used as a calculator on data. Draw a dot plot of the data \[2, 3, 4, 5, 6\] and by hand calculate the mean. Now use R to calculate the mean and standard deviation using the commands below. Can you write code to calculate the mean and standard deviation of data_2?

data_1 = c(2, 3, 4, 5, 6)
mean_1 = mean(data_1)
SD_1 = sd(data_1)
print("This is data_1:")

## [1] "This is data_1:"

data_1

## [1] 2 3 4 5 6

mean_1

## [1] 4

SD_1

## [1] 1.581139

data_2 = 2*data_1
print("This is data_2:")

## [1] "This is data_2:"

data_2

## [1]  4  6  8 10 12

Reading in data from a website.

Next let’s import a data set from our textbook into our workspace. The data is available here <link https://www.isi-stats.com/isi2nd/data.html> under Chapter 2, Example2.1/2.2:College Midwest. The observational units of the data are students from a college in the midwest. After reading the data you should see data in the Environment pane with name ‘CollegeMidwest’ containing 2919 observational units and 2 variables.

CollegeMidwest <- read_table("http://www.isi-stats.com/isi/data/chap3/CollegeMidwest.txt")

Basic commands to view the data.

The following commands can be used to viewing parts of the data: glimpse, head, tail. Try entering them into your console.

glimpse(CollegeMidwest)

## Rows: 2,919
## Columns: 2
## $ OnCampus <chr> "N", "N", "N", "N", "N", "Y", "Y", "Y", "N", "Y", "Y", "Y", "…
## $ CumGpa   <dbl> 2.92, 3.59, 3.36, 2.47, 3.46, 2.98, 3.07, 3.79, 3.21, 3.67, 3…

To look at the first six rows of the data use the ‘head( )’ command.

head(CollegeMidwest)

## # A tibble: 6 × 2
##   OnCampus CumGpa
##   <chr>     <dbl>
## 1 N          2.92
## 2 N          3.59
## 3 N          3.36
## 4 N          2.47
## 5 N          3.46
## 6 Y          2.98

To look at the last six rows of the data use the ‘tail( )’ command.

tail(CollegeMidwest)

## # A tibble: 6 × 2
##   OnCampus CumGpa
##   <chr>     <dbl>
## 1 Y          3.09
## 2 Y          2.8 
## 3 Y          4   
## 4 N          3.35
## 5 Y          3.33
## 6 Y          2.99

You can also view the data in another tab by clicking on ‘CollegeMidwest’ in the Environment pane.

Data Summary Statistics.

The command below extracts the gpa data from CollegeMidwest and assigns it to the new variable GPA. Then it calculates summary statistics for the GPA data. Can you calculate the mean gpa? The standard deviation of the gpas? Try it!

GPA <- CollegeMidwest$CumGpa

summary(GPA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.010   3.410   3.288   3.700   4.000

Data dot plots, tables and bar charts.

We can create a dot plot of the quantitative gpa data using the code shown below. The name ggplot comes from an abbreviation of ‘grammar of graphics’ which creates graphics by layering several different elements. The first element is ggplot(CollegeMidwest, aes(x=CumGpa)) arguments inside ggplot include the name of the data and the aesthetics which are the collection of variables that will be used in the plot.

ggplot(CollegeMidwest, aes(x=CumGpa))+geom_dotplot(binwidth = 0.05, dotsize = 0.25)+labs(x="Grade Point Average")

Next we create a table of counts from the categorical OnCampus variable in CollegeMidwest. The output consists of the number of students reporting no, they don’t live on campus and the number reporting yes, they do live on campus.

table_OC <- table(CollegeMidwest$OnCampus)
table_OC

## 
##    N    Y 
##  654 2265

From this table we can create a bar chart of the categorical OnCampus variable using the following command. Notice that this command uses the table we created above, so creating the bar chart requires that we previously created the table of counts.

barplot(table_OC, main="Campus Housing Distribution")

An alternate collection of commands to create a bar chart is shown below. These commands are use the tidyverse package that we installed at the beginning of our lab. The command starts with ‘ggplot’ which is an abbreviation of grammar of graphics. Inside the ggplot function is aes( ) which adds the aesthetics to our graphic. Next comes the geom_bar which tells R that we want to create a bar chart. An last is the ggtitle which adds the title to the graph. The plus command, ‘+’, is used to put the pieces together ggplot(…)+ geom_bar( )+ ggtitle( ) to create the final graphic.

ggplot(CollegeMidwest, aes(OnCampus, fill=OnCampus))+geom_bar()+ggtitle("Bar Chart of Housing on Campus")

Have questions about ggplot? Try the command below.

?ggplot

Introduction to R Markdown

This next part of the lab contains exercises for you to complete and turn in. The document you turn in will include -the text for each exercise, -the R commands you write, -the output of R commands, and -any additional text needed to explain the results.

Instead of using copy paste and a Word document you will write your solutions as an R Markdown file, which is a single file created in RStudio that will include your text, code, code output, and your conclusions. The document that you are reading right now was created using R Markdown. To open a new R Markdown file use the New file button (green button with plus and white square) in the upper left hand corner of the screen. See the red arrow in the image below. Here is the new file button.

After clicking the newfile button, select “R Markdown” from the dropdown menu. Next select “From Template”, select “LabReport {openintro}”, and name your document YourfirstnameLastname-Lab1.Rmd

The R Markdown file should now be open in a new pane in the upper left corner. From here forward, you will write text and code in the text-chunks and code-chunks of your Lab1.rmd file. The following steps will be repeated for every lab assignment.

Change the Lab Name to “Lab 1: Introduction to R and RStudio” or the name of the current lab.
Change the Author Name to your name.
Load the four packages readr, tidyverse, rmarkdown, openintro into your R Markdown file using the library() command.
For each exercise,
- copy the text of the exercise into the text chunk and
- put your code in a code chunk that follows the text
- write comments, explanations, or conclusions in the space following the code.

You can create new code chunks using the green button with the C and selecting “R” to make an R code chunk. See blue arrow in the image above.

Import the National Anthem Time data from the website [link http://www.isi-stats.com/isi/data/prelim/NationalAnthemTimes.txt]. Name the data NationalAnthemTime Note that this name has no spaces. This is important feature of a variable name.
Take a look at the National Anthem Time data using head and tail commands.
Make a table of counts for the categorical variable Genre. Make sure to give your table a name as we will be using the table in Exercise 4.
Create a bar chart of the Genre variable using your table of counts from Exercise 3. Give your graph the title “19xx - 20xx Super Bowl National Anthem Singers by Genre” but replace the xx’s with the actual range of years for this data.
Use R as a calculator to find the proportion of singers from the R&B/Soul Genre during the range of years in question. Use the variable name ‘proportion’ and write the code to calculate and display the proportion as, \[ \textrm{proportion} = \dots \]
Create a dot plot of the quantitative variable Time from the NationalAnthemTimes data. Adjust the binwidth and dotsize so that the dots are easily countable. Give an appropriate title and include the time units for your graph. Recall, we saw this data in the initial chapter of our zyBook. You can find the description for this data by looking in the textbook’s “Chapter 12”.
BONUS. Use the ‘ggplot’ command to create a bar chart for the categorical Genre variable.