Our Goal

This tutorial aims to help students use R to do some basic statistical analysis for SOC201 Research Methods in Sociology.

Install R and Rstudio

Let us do it from scratch. We are using Rstudio in this tutorial. Rstudio is the most popular R code editor….

Before installing Rstudio, you need to Install R first. You can download R and other necessary software by clicking here https://cran.r-project.org/. You can choose the appropriate version for your system (e.g., windows, Mac). Be careful to follow its installing instruction, especially regarding those necessary software like xquartz if you are a mac user.

After this step, please go to RStudio website to download and install RStudio desktop. You can click here https://rstudio.com/products/rstudio/download/ and choose the free version.

Then, you can use install.packages function to install necessary R packages. But I suggest you to copy and paste following code to install some common packages for data processing and visualization. You can add any packages you want to install by defining the “packages” variable.

Packages are developed by the R user community to achieve different goals. For instance, tidyverse is a wrapper for a lot of different packages designed for data science, including dplyr, tidyr, readr,purr, tibble, etc. Click here for more details https://www.tidyverse.org/packages/

# pacman is a package help you manage packages: load installed packages; or install and load packages if you have not installed them
# you can use install.packages("package_name") to install necessary packages
# in this tutorial, we need the following packages:
if (!requireNamespace("pacman"))
  install.packages('pacman')
library(pacman)
packages<-c("tidyverse","ggplot2","DescTools","gtools","MASS")
p_load(packages,character.only = TRUE)

LET US REVIEW SOME BASICS ABOUT R

Assign Values to Variables

In R, you can use the assignment operator <- to create new variables. I also see people use the equal sign (“=”) to do that. But for code consistency, we should stick to the <- sign.

If you are using windows system, you can press “alt” and “-” simultaneously to type <- In mac, I believe it is opt and -

# One example of assigning values to variables

# let us create a variable "a" and assign a value of 5 to it.
a <- 5
a

## [1] 5

# let us create a variable b and assign a value 100 to it

b <- 100
b

## [1] 100

# you can also assign a string

c <- "hello word"
c

## [1] "hello word"

Arithmetic and Logical Operators

Arithmetic Operator	Description
+	addition
-	subtraction
*	multiplication
/	division
^ or **	exponentiation

Logical Operator	Description
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to

Let us see some examples

# we can do some operation in R
# let us create a new variable d that stores a and b
d <- a + b
d

## [1] 105

# let do other arithmetic operations
e <- a - b
e

## [1] -95

f <- a * b
f

## [1] 500

g <- a ** b
g

## [1] 7.888609e+69

# but you cannot do this for different data types
# h <- a + c
# h

# let us try some logical operation
# Note that a = 5
a == 5

## [1] TRUE

a <= 100

## [1] TRUE

a !=100

## [1] TRUE

# as you see here, it returns TRUE OR FALSE

Data Types

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Types	Examples
scalars	a <- 1; b<-“hello word”; c <- (a ==1)
vectors	a <- c(1, 2, 3, 4); b <- c (“a”, “b”, “c”)
matrices	let us see a 2 by 3 matrix: a <- matrix(c(2, 4, 3, 1, 5, 7), # the data elements nrow=2, # number of rows ncol=3, # number of columns byrow = TRUE) # fill matrix by rows
data frames	A data frame is used for storing data tables. It is a list of vectors of equal length. For example, df is a data frame containing three vectors n, s, b. n = c(2, 3, 5) s = c(“aa”, “bb”, “cc”) b = c(TRUE, FALSE, TRUE) df = data.frame(n, s, b)
lists	Lists contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using list() function. list_example<- list(“Red”, “Green”, c(1,2,3), TRUE, 520, 10000)

let us see some examples:

# scalars
a <-  2
class(a)

## [1] "numeric"

# vectors
b <- c("a","b","c")
b

## [1] "a" "b" "c"

class(b)

## [1] "character"

# matrix
c <- matrix(c(2, 4, 3, 1, 5, 7),nrow=2, ncol=3,byrow = TRUE)
c

##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7

class(c)

## [1] "matrix" "array"

# data frame
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df

##   n  s     b
## 1 2 aa  TRUE
## 2 3 bb FALSE
## 3 5 cc  TRUE

class(df)

## [1] "data.frame"

# lists
list_example<- list("Red", "Green", c(1,2,3), TRUE, 520, 10000)
list_example

## [[1]]
## [1] "Red"
## 
## [[2]]
## [1] "Green"
## 
## [[3]]
## [1] 1 2 3
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] 520
## 
## [[6]]
## [1] 10000

class(list_example)

## [1] "list"

LET US GET IT STARTED

We Are Using COVID19 Cases in the U.S. as an example

Let us load the covid19 by state data from NYT

Let us take a look at the data structure

head(covid)

## # A tibble: 6 x 5
##   date       state      fips  cases deaths
##   <date>     <chr>      <chr> <dbl>  <dbl>
## 1 2020-01-21 Washington 53        1      0
## 2 2020-01-22 Washington 53        1      0
## 3 2020-01-23 Washington 53        1      0
## 4 2020-01-24 Illinois   17        1      0
## 5 2020-01-24 Washington 53        1      0
## 6 2020-01-25 California 06        1      0

If you look at the data, it is about accumulated daily confirmed cases for each state. We want a dataset that keeps the most updated ones.

data <- covid %>% 
  filter(!is.na(fips)) %>% 
  # only keep the current one
  filter(date=="2020-10-28")

If you look at the data, then it is structured by state. Let us see how frequency distribution looks like. We can use ggplot to plot the histogram of deaths by u.s. states. The x-axis is the number of deaths, and the y is the number of states.

data %>% 
  ggplot() +
  geom_histogram(aes(x=deaths/1000))+
  labs(x="total deaths in thousands",y="number of states")

ggsave("histogram_state_death.png")

let us do some descriptive statistics. Let us focus on their central tendency first.

summary(data)

##       date               state               fips               cases       
##  Min.   :2020-10-28   Length:55          Length:55          Min.   :    92  
##  1st Qu.:2020-10-28   Class :character   Class :character   1st Qu.: 35928  
##  Median :2020-10-28   Mode  :character   Mode  :character   Median :108863  
##  Mean   :2020-10-28                                         Mean   :162426  
##  3rd Qu.:2020-10-28                                         3rd Qu.:186764  
##  Max.   :2020-10-28                                         Max.   :931576  
##      deaths     
##  Min.   :    2  
##  1st Qu.:  594  
##  Median : 1875  
##  Mean   : 4140  
##  3rd Qu.: 4440  
##  Max.   :33107

# you can also use mean() function to compute the mean
mean(data$cases,na.rm = TRUE)

## [1] 162426.3

Let us take a look at the dispersion of our state-level covid cases and deaths (e.g., range, sd).

sd(data$cases, na.rm = TRUE)

## [1] 204395

sd(data$deaths, na.rm = TRUE)

## [1] 6095.013

summary(data[,4:5])

##      cases            deaths     
##  Min.   :    92   Min.   :    2  
##  1st Qu.: 35928   1st Qu.:  594  
##  Median :108863   Median : 1875  
##  Mean   :162426   Mean   : 4140  
##  3rd Qu.:186764   3rd Qu.: 4440  
##  Max.   :931576   Max.   :33107

quantile(data$deaths)

##      0%     25%     50%     75%    100% 
##     2.0   594.0  1875.0  4439.5 33107.0

Let us see if you are interested in the partisan differences in covid cases or deaths

# let us load governor database
gov <- read_csv(url("https://raw.githubusercontent.com/CivilServiceUSA/us-governors/master/us-governors/data/us-governors.csv"))

# let us merge gov data with our main data

data_gov <- data %>% 
  left_join(gov,by=c("state"="state_name")) %>% 
  filter(!is.na(state_code))

Let us say you are interested in the subgroup analysis. You want to see how the total cases vary across Republican and Democratic states. Let us say you are interested in whether the number of the total cases is over 100K.

# let us create a variable indicate whether the state has over 50k cases
data_gov <- data_gov %>% 
  mutate(over_50k=ifelse(cases>50000,1,0))
table(data_gov$party,data_gov$over_50k)

##             
##               0  1
##   democrat    8 16
##   republican  6 20

Descriptive Statistics- Measures of association

Nominal Vars - lamda(\(\lambda\))

Goodman and Kruskal’s lambda coefficient is a measure of association for nominal variables. Lambda ranges from 0.00 to 1.00. A lambda of 0.00 reflects no association between variables. A Lambda of 1.00 is a perfect association. Lambda does not give you a direction of association: it simply suggests an association between two variables and its strength.

Lambda(x, y = NULL, direction = c(“symmetric”, “row”, “column”), conf.level = NA, …)

You can use ?lambda to check the arguments of this function in R

library(DescTools)
# we use lambda from package DescTools
# if you have not installed it, you can `install.packages("DescTools")`
# here is the link to the documentation<https://cran.rstudio.com/web/packages/DescTools/DescTools.pdf>
Lambda(x=data_gov$party,data_gov$over_50k)

## [1] 0.05263158

We found a pretty weak association.

Ordinal Vars - gamma(\(\gamma\))

Gamma is a measure of association for ordinal variables. Gamma ranges from -1.00 to 1.00. Again, a Gamma of 0.00 reflects no association; a Gamma of 1.00 reflects a positive perfect relationship between variables; a Gamma of -1.00 reflects a negative perfect relationship between those variables.

#Calculate Goodman Kruskal's Gamma statistic, a measure of association for ordinal factors in a two-way table.
#The function has interfaces for a table (matrix) and for single vectors.

# We need to create two ordinal variable
# let us say we have a variable indicating the level of total cases and another ordinal variable indicate the level of total deaths
# we use quantcut from gtools to split the deaths and cases into 10 quantile...
# install.packages("gtools")
library(gtools)
data_gov <- data_gov %>% 
  mutate(rank_death=quantcut(deaths,10),
         rank_case=quantcut(cases,10))
GoodmanKruskalGamma(data_gov$rank_death,data_gov$rank_case)

## [1] 0.8393881

We found a pretty positive strong relationship. It is intuitive, right. If a state has more confirmed cases, it has more deaths.

Interval or Ratio Vars -Pearson’s r

Pearson’s r is a measure of association for continuous variables. Like Gamma, Pearson’s r ranges from -1.00 to 1.00.

Cor(x=data_gov$cases, y = data_gov$deaths,
    method = c("pearson"))

## [1] 0.7931014

Regression Analysis

You can click here for a simple tutorial http://r-statistics.co/Linear-Regression.html

Let us say you are interested in explaining the total deaths in the u.s.

You think maybe the total cases can help explain it. Maybe party affiliation of the governor. Maybe other variables. Let us run a demo here.

# first let us plot the scatterplot of two vars
scatter.smooth(x=data_gov$cases, y=data_gov$deaths, main="Deaths ~ Cases")  # scatterplot

It seems there is a linear relationship. let us fit a model.

Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is lm(). The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class formula. But the most common convention is to write out the formula directly in place of the argument as written below.

linearMod <- lm(deaths ~ cases, data=data_gov)  # build linear regression model on full data
summary(linearMod)

## 
## Call:
## lm(formula = deaths ~ cases, data = data_gov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4730.3 -1286.8  -584.3  -102.6 20766.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.122e+02  7.179e+02   0.435    0.666    
## cases       2.380e-02  2.638e-03   9.021  6.6e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3857 on 48 degrees of freedom
## Multiple R-squared:  0.629,  Adjusted R-squared:  0.6213 
## F-statistic: 81.38 on 1 and 48 DF,  p-value: 6.598e-12

This suggests that you can use the formula to predict the deaths: \[y^{hat}=311.6088+0.0238*cases\] The relationship is also statistically significant at the .001 level.

You can also add more independent variables. for instance,

linearMod2 <- lm(deaths ~ cases+party, data=data_gov)  # build linear regression model on full data
summary(linearMod2)

## 
## Call:
## lm(formula = deaths ~ cases + party, data = data_gov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5758.0 -1428.4  -697.5   533.0 19783.8 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.240e+03  8.909e+02   1.392   0.1706    
## cases            2.391e-02  2.588e-03   9.236  3.9e-12 ***
## partyrepublican -1.821e+03  1.071e+03  -1.700   0.0958 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3783 on 47 degrees of freedom
## Multiple R-squared:  0.6505, Adjusted R-squared:  0.6356 
## F-statistic: 43.74 on 2 and 47 DF,  p-value: 1.868e-11

Let Us Move to Inferential Statistics

Chi-square

library(MASS)       # load the MASS package 
tbl = table(data_gov$rank_case, data_gov$rank_death) # the contingency table 
chisq.test(tbl)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 134, df = 81, p-value = 0.0001952

T-Test

let us say, if you want to check the group means difference between gender for total deaths.

x=data_gov %>% filter(gender=="male") %>% dplyr::select(deaths) 
y=data_gov %>% filter(gender=="female") %>% dplyr::select(deaths) 
t.test(x,y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 2.5114, df = 38.662, p-value = 0.01632
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   634.8327 5897.4708
## sample estimates:
## mean of x mean of y 
##  5110.707  1844.556

Let us plot the group mean difference

#install.packages(ggpubr)
library("ggpubr")
data_gov <- data_gov %>% 
  mutate(deaths_10k=deaths/1000)
ggboxplot(data_gov, x = "gender", y = "deaths_10k", 
          color = "gender", palette = c("#00AFBB", "#E7B800"),
        ylab = "Deaths in 10K", xlab = "Groups")

The End

Basic Statistical Analysis in R for SOC201 Research Methods in Sociology