Every Lab will have two parts. The Pre-Lab is a guided code-with-me activity. Don’t worry if you fall behind or miss something. The solutions to the Pre-Lab will be posted online for you to consult whenever you wish. The Lab assignment is a collection of questions (similar to our in-class Explorations) that you will answer using text, code and mathematical expressions. The Pre-Lab will give you helpful examples to follow when working on the Lab. Your solutions to the Lab will be written in an Rmarkdown document that will contain your text, mathematical expressions, code and the output of your code.
We will be using the software R, RStudio, and the remote server Posit Cloud throughout the semester to learn about the statistical concepts and how to analyze real data. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface that we will use within Posit Cloud.
Initially there are three panes in the RStudio interface. The pane in the upper right contains your Environment workspace as well as a History of the commands that you’ve entered. When you import data or define a variable they will appear in your Environment.
Any Files that you upload or generate will show up in the pane in the lower right corner. The lower right also contains a list of installed Packages that you can click on to put in your working library. R packages have complex dependency relationships, but often if you need a package installed then R will ask if you want to install it. When this happens, just follow the prompts to load the package.
The pane on the left is where the action happens. The current display shows the Console. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
TO DO NEXT: Within the Files pane in the lower right, click on the Lab 1: Pre-Lab folder and open the document Prelab1.Rmd. You can now view and edit the Rmarkdown file that was used to create this document.
Rather than typing commands into the Console we will be writing our work in an R Markdown document. As described in this video Why use Rmarkdown?, we are using R Markdown to combine our code, data visualization & analysis, and written conclusions into one lab report. Here is a reference document that gives an overview of basic R Markdown components and commands.
When beginning any new lab or project, we will start by making sure
certain “packages” in R are available for us to use. This is a two step
process. First we install the package with the
install.packages
command. Second we make the package
accessible to us using the library
command. The two main
packages we will use are called mosaic
and
ggformula
. We install the mosaic package with the
command
\[ \textrm{ install.packages(''mosaic'') }\] This command makes the ‘mosaic’ package available for us to use. The mosaic package includes many tools for calculating summary statistics, performing theory-based hypothesis tests, and calculating confidence intervals. Since we are using Posit Cloud, you won’t have to perform the install packages step since I will be sharing assignments with you that alread have the nesseccary packages loaded. I’m mentioning the installation step here in case you want to use Posit Cloud for your own projects. In this case, you’ll need to install the packages yourself.
For now we will skip the install step and load the two required
packages: mosaic
and ggformula
.
TO DO NEXT: Here is the command to load the
mosaic
package. Write the command to enter the
ggformula
package.
Before doing anything fancy, notice that R can be used as a calculator. Run the code and calculate the value of z by hand to verify the answer.
#This is a comment. Any line of code with a # sign in front
# is a comment and ignored by R.
#After running the code (in Posit Cloud), alter it slightly and see what happens.
x <- -1
y <- 4
q <- 2
z <- (x^2 + 3*y - 10)/q
z
## [1] 1.5
R can also be used as a calculator with data. Imagine a dot plot of the data \[2, 3, 4, 5, 6.\] Determine the mean and estimate the standard deviation of the data. Next use R to calculate the mean and standard deviation using the commands below.
## [1] 2 3 4 5 6
## [1] 4
## [1] 1.581139
We’ll load the example data, GSS22clean.csv
. It is
available at this Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv.
Since the data is a csv file, we will use the read.csv()
function to read in the data.
# this command will load data and save it as GSS22
GSS22 <- read.csv("https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv")
This dataset comes from the 2022 General Social Survey (GSS), which is collected by NORC at the University of Chicago. It contains data from a random sample of households from the United States. This survey has been running since 1972, so it is very useful for studying trends in American life. The data I’ve given you is a subset of the questions asked in the survey, and the data has been cleaned to make it easier for us to use. But, there are still some messy aspects which we’ll discover as we analyze it repeatedly this semester.
NOTE: We will learn two basic commands depending on
whether the data is a .txt
file or a .csv
.
The following commands can be used to view parts of the data:
glimpse
, head
, tail
.
## Rows: 3,544
## Columns: 49
## $ year <int> 2022, 2022, 2022, 2022, 2022, …
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,…
## $ work_status <chr> "full time", "in school", "ful…
## $ hours_worked_last_week <int> 40, NA, 52, NA, NA, 50, 30, 40…
## $ works_for <chr> NA, "someone else", "someone e…
## $ marital_status <chr> "divorced", "married", "divorc…
## $ siblings <int> 1, 3, 1, 1, 2, 1, 3, 3, 4, 6, …
## $ children <int> 1, 2, 1, 0, 2, 0, 0, 0, 1, 4, …
## $ age <int> 72, 80, 57, 23, 62, 27, 20, 47…
## $ age_when_firstchild_born <int> 27, 24, 27, NA, 21, NA, NA, NA…
## $ years_education <int> 16, 18, 12, 16, 14, 12, 12, 16…
## $ degree <chr> "bachelors", "graduate", "high…
## $ major <chr> "sociology", "business adminis…
## $ sex <chr> "FEMALE", "MALE", "FEMALE", "F…
## $ race <chr> "WHITE", "WHITE", "WHITE", "WH…
## $ residence_type <chr> "suburb near large city", "sub…
## $ family_living_location_when_sixteen <chr> "middle atlantic", "east north…
## $ family_income_when_sixteen <chr> "above average", "below averag…
## $ born_in_US <chr> "yes", "yes", "yes", "yes", "y…
## $ number_grandparents_born_outside_US <int> 4, NA, 1, 0, 2, 0, 4, 4, 0, NA…
## $ total_family_income <chr> "$90k to $110k", NA, "$40k to …
## $ region_of_interview <chr> "new england", "new england", …
## $ party_id <chr> "Democrat", "Independent", "Re…
## $ remember_voting_in_2016 <chr> "voted", "voted", "voted", "in…
## $ Hillary_or_Trump <chr> "Clinton", "Clinton", "Trump",…
## $ if_you__would_have_.voted_who_for <chr> NA, NA, NA, "Clinton", NA, NA,…
## $ political_views <chr> "liberal", "slightly conservat…
## $ money_space_exploration_program <chr> "about right", "too little", N…
## $ money_improve_protect_environment <chr> "too little", "too little", NA…
## $ money_improve_protect_health <chr> "too little", "about right", N…
## $ money_helping_bigcities <chr> "about right", "too little", N…
## $ money_decrease_crime <chr> "too little", "too little", NA…
## $ money_addressing_drug_addiction <chr> "about right", "about right", …
## $ money_improving_education <chr> "too little", "about right", N…
## $ money_improving_conditions_blacks <chr> "too little", "too much", NA, …
## $ money_national_defense <chr> "about right", "about right", …
## $ money_on_foreign_aid <chr> "about right", "about right", …
## $ money_on_welfare <chr> "about right", "too much", NA,…
## $ should_gov_reduce_income_inequality <int> 1, NA, 5, NA, 1, 1, 3, 1, NA, …
## $ permit_to_buy_gun <chr> "favor", "oppose", "favor", "f…
## $ should.be.legal_marijuana <chr> NA, NA, NA, NA, NA, NA, "shoul…
## $ attend_religious_services <chr> "about once or twice a year", …
## $ happiness <chr> "not too happy", "not too happ…
## $ personal_health <chr> "good", "good", "good", "good"…
## $ sex_education_public_schools <chr> NA, "in favor", NA, "in favor"…
## $ premarital_sex <chr> NA, "not wrong at all", NA, "n…
## $ sometimes_hard_spanking_child_necessary <chr> NA, "disagree", NA, "agree", N…
## $ gun_in_home <chr> "no", "no", "no", "no", "no", …
## $ read_newspaper <chr> NA, "every day", NA, "never", …
You can also view the data in another tab by clicking on ‘GSS’ in the Environment pane. This allows you to scroll up and down, and left and right to view the data.
Use the commands glimpse
, head
,
tail
on the GSS data to answer the following questions.
Number of children, age of respondent, highest year of school completed, number of siblings, & many others.
One quantitative variable and calculate the mean value.
Name a couple variables that are categorical.
college major 1, college major 2, diploma/GED/Other, self-employed or works for somebody, Occupation code, & many others.
Pick one categorical variable and create a table of counts for the categories.
Pick out an observation to write about. What are some characteristics of this observation?
Example: Observational unit number 10 is a 55 year old father of 2. He works a 40 hour work week in a private (non-government) job. He has a high-school diploma with a total 12 years of education, his spouse has 11 years, father has 6 years and mother has 20 years of education.
Examples: Do people who are self-employed work more hours than people that work for someone else? Do religious people support the death penalty similarly, regardless of their religion? Is a persons happiness level associated with their age? Are people with higher household incomes more or less likely to believe that marijuana should be illegal?
Calculate the average age of the people surveyed in
the 2022 GSS. We use the function mean( )
applied to the
variable ~years_education
from our data file
data=GSS22
and we will remove all NA values using the code
na.rm=TRUE
## [1] 14.10812
Calculate the average number of siblings for the people surveyed in the 2022 GSS.
## [1] 3.427678
Calculate the number and proportion of people that
are self-employed or work for someone else. We use the function
tally( )
to create a table of counts for the variable
~works_for
in the data file data=GSS22
. If we
want our talle to display the proportion of values in the various
categories we add the code format="proportion"
.
## works_for
## self-employed someone else <NA>
## 393 3012 139
## works_for
## self-employed someone else <NA>
## 0.11089165 0.84988713 0.03922122
Calculate the number and proportion of people that have attained various educational degrees.
## degree
## associate bachelors graduate
## 317 737 477
## high school less than high school
## 1654 359
## degree
## associate bachelors graduate
## 0.08944695 0.20795711 0.13459368
## high school less than high school
## 0.46670429 0.10129797
Suppose we knew that a few years back 11.1% of the US adult population attained an education level of “less than high school”. From the 2022 GSS survey it looks as though that the proportion may have dropped. Perform a hypothesis test to determine if the drop is significant.
Our null and alternative hypotheses are
\[ H_0: \pi = 0.111 \textrm{ vs. } H_a: \pi < 0.111\] Our statistic is \(\hat{p} = 0.10129797\), as shown above.
Perform the One Proportion Hypothesis test We use
the function prop.test( )
applied to the variable
~degree
from the data file data=GSS22
, with
success corresponding to success = "less than high school"
,
our hypothesized population proportion (our \(\pi\) value) is p=0.11
and our
alternative hypothesis is less than, <, which we write in the code as
alternative="less"
prop.test(~degree, data=GSS22, success="less than high school", p=0.111, alternative="less", correct=FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: GSS22$degree [with success = less than high school]
## X-squared = 3.3806, df = 1, p-value = 0.03298
## alternative hypothesis: true p is less than 0.111
## 95 percent confidence interval:
## 0.0000000 0.1099411
## sample estimates:
## p
## 0.101298
#alternate code to calculate the p-value from the counts.
#The 359 is the success count and n=3544 is the sample size.
#p=0.11 is the null hypothesis value of pi and our alternative hypothesis is less than.
prop.test(359, 3544, p=0.111, alternative="less")
##
## 1-sample proportions test with continuity correction
##
## data: 359 out of 3544
## X-squared = 3.283, df = 1, p-value = 0.035
## alternative hypothesis: true p is less than 0.111
## 95 percent confidence interval:
## 0.0000000 0.1100872
## sample estimates:
## p
## 0.101298
The validity conditions for one proportion inference are that there are at least 10 successes and 10 failures. The number of successes is the sample size times the proportion for success, \(n \pi\).
## [1] 393.384
## [1] 3150.616
The success and failure counts are both much larger than 10, so our
validity conditions are satisfied. In cases where the counts are close
to 10 or just under 10, a continuity correction should be used and the
default in R is to use the correction (better safe than sorry). Since
our success/failure counts here are much larger than 10 we include the
code correct=FALSE
so that the continuity
correction is not applied.
Conclusion: From the 1-sample proportion test without continuity correction we have a p-value of 0.03298. This is strong evidence against the null hypothesis. We reject the null hypothesis and conclude that in 2022 the true proportion of adults with less than a high school education is less than 11.1%.