Basic Statistical Analysis in R for SOC201 Research Methods in Sociology
Our Goal
This tutorial aims to help students use R to do some basic statistical analysis for SOC201 Research Methods in Sociology.
Install R and Rstudio
Let us do it from scratch. We are using Rstudio
in this tutorial. Rstudio is the most popular R code editor….
Before installing Rstudio
, you need to Install R
first. You can download R and other necessary software by clicking here https://cran.r-project.org/. You can choose the appropriate version for your system (e.g., windows, Mac). Be careful to follow its installing instruction, especially regarding those necessary software like xquartz if you are a mac user.
After this step, please go to RStudio website to download and install RStudio desktop. You can click here https://rstudio.com/products/rstudio/download/ and choose the free version.
Then, you can use install.packages
function to install necessary R packages. But I suggest you to copy and paste following code to install some common packages for data processing and visualization. You can add any packages you want to install by defining the “packages” variable.
Packages are developed by the R user community to achieve different goals. For instance, tidyverse is a wrapper for a lot of different packages designed for data science, including dplyr, tidyr, readr,purr, tibble, etc. Click here for more details https://www.tidyverse.org/packages/
# pacman is a package help you manage packages: load installed packages; or install and load packages if you have not installed them
# you can use install.packages("package_name") to install necessary packages
# in this tutorial, we need the following packages:
if (!requireNamespace("pacman"))
install.packages('pacman')
library(pacman)
packages<-c("tidyverse","ggplot2","DescTools","gtools","MASS")
p_load(packages,character.only = TRUE)
LET US REVIEW SOME BASICS ABOUT R
Assign Values to Variables
In R, you can use the assignment operator <- to create new variables. I also see people use the equal sign (“=”) to do that. But for code consistency, we should stick to the <- sign.
If you are using windows system, you can press “alt” and “-” simultaneously to type <- In mac, I believe it is opt and -
# One example of assigning values to variables
# let us create a variable "a" and assign a value of 5 to it.
a <- 5
a
## [1] 5
## [1] 100
## [1] "hello word"
Arithmetic and Logical Operators
Arithmetic Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
Logical Operator | Description |
---|---|
> | greater than |
>= | greater than or equal to |
== | exactly equal to |
!= | not equal to |
Let us see some examples
## [1] 105
## [1] -95
## [1] 500
## [1] 7.888609e+69
# but you cannot do this for different data types
# h <- a + c
# h
# let us try some logical operation
# Note that a = 5
a == 5
## [1] TRUE
## [1] TRUE
## [1] TRUE
Data Types
R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.
Types | Examples |
---|---|
scalars | a <- 1; b<-“hello word”; c <- (a ==1) |
vectors | a <- c(1, 2, 3, 4); b <- c (“a”, “b”, “c”) |
matrices | let us see a 2 by 3 matrix: a <- matrix(c(2, 4, 3, 1, 5, 7), # the data elements |
data frames | A data frame is used for storing data tables. It is a list of vectors of equal length. For example, df is a data frame containing three vectors n, s, b. n = c(2, 3, 5) |
lists | Lists contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using list() function. list_example<- list(“Red”, “Green”, c(1,2,3), TRUE, 520, 10000) |
let us see some examples:
## [1] "numeric"
## [1] "a" "b" "c"
## [1] "character"
## [,1] [,2] [,3]
## [1,] 2 4 3
## [2,] 1 5 7
## [1] "matrix" "array"
# data frame
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df
## n s b
## 1 2 aa TRUE
## 2 3 bb FALSE
## 3 5 cc TRUE
## [1] "data.frame"
## [[1]]
## [1] "Red"
##
## [[2]]
## [1] "Green"
##
## [[3]]
## [1] 1 2 3
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] 520
##
## [[6]]
## [1] 10000
## [1] "list"
LET US GET IT STARTED
We Are Using COVID19 Cases in the U.S. as an example
Let us load the covid19 by state data from NYT
Let us take a look at the data structure
## # A tibble: 6 x 5
## date state fips cases deaths
## <date> <chr> <chr> <dbl> <dbl>
## 1 2020-01-21 Washington 53 1 0
## 2 2020-01-22 Washington 53 1 0
## 3 2020-01-23 Washington 53 1 0
## 4 2020-01-24 Illinois 17 1 0
## 5 2020-01-24 Washington 53 1 0
## 6 2020-01-25 California 06 1 0
If you look at the data, it is about accumulated daily confirmed cases for each state. We want a dataset that keeps the most updated ones.
If you look at the data, then it is structured by state. Let us see how frequency distribution looks like. We can use ggplot to plot the histogram of deaths by u.s. states. The x-axis is the number of deaths, and the y is the number of states.
data %>%
ggplot() +
geom_histogram(aes(x=deaths/1000))+
labs(x="total deaths in thousands",y="number of states")
let us do some descriptive statistics. Let us focus on their central tendency first.
## date state fips cases
## Min. :2020-10-28 Length:55 Length:55 Min. : 92
## 1st Qu.:2020-10-28 Class :character Class :character 1st Qu.: 35928
## Median :2020-10-28 Mode :character Mode :character Median :108863
## Mean :2020-10-28 Mean :162426
## 3rd Qu.:2020-10-28 3rd Qu.:186764
## Max. :2020-10-28 Max. :931576
## deaths
## Min. : 2
## 1st Qu.: 594
## Median : 1875
## Mean : 4140
## 3rd Qu.: 4440
## Max. :33107
## [1] 162426.3
Let us take a look at the dispersion of our state-level covid cases and deaths (e.g., range, sd).
## [1] 204395
## [1] 6095.013
## cases deaths
## Min. : 92 Min. : 2
## 1st Qu.: 35928 1st Qu.: 594
## Median :108863 Median : 1875
## Mean :162426 Mean : 4140
## 3rd Qu.:186764 3rd Qu.: 4440
## Max. :931576 Max. :33107
## 0% 25% 50% 75% 100%
## 2.0 594.0 1875.0 4439.5 33107.0
Let us see if you are interested in the partisan differences in covid cases or deaths
# let us load governor database
gov <- read_csv(url("https://raw.githubusercontent.com/CivilServiceUSA/us-governors/master/us-governors/data/us-governors.csv"))
# let us merge gov data with our main data
data_gov <- data %>%
left_join(gov,by=c("state"="state_name")) %>%
filter(!is.na(state_code))
Let us say you are interested in the subgroup analysis. You want to see how the total cases vary across Republican and Democratic states. Let us say you are interested in whether the number of the total cases is over 100K.
# let us create a variable indicate whether the state has over 50k cases
data_gov <- data_gov %>%
mutate(over_50k=ifelse(cases>50000,1,0))
table(data_gov$party,data_gov$over_50k)
##
## 0 1
## democrat 8 16
## republican 6 20
Descriptive Statistics- Measures of association
Nominal Vars - lamda(\(\lambda\))
Goodman and Kruskal’s lambda coefficient is a measure of association for nominal variables. Lambda ranges from 0.00 to 1.00. A lambda of 0.00 reflects no association between variables. A Lambda of 1.00 is a perfect association. Lambda does not give you a direction of association: it simply suggests an association between two variables and its strength.
Lambda(x, y = NULL, direction = c(“symmetric”, “row”, “column”), conf.level = NA, …)
You can use ?lambda
to check the arguments of this function in R
library(DescTools)
# we use lambda from package DescTools
# if you have not installed it, you can `install.packages("DescTools")`
# here is the link to the documentation<https://cran.rstudio.com/web/packages/DescTools/DescTools.pdf>
Lambda(x=data_gov$party,data_gov$over_50k)
## [1] 0.05263158
We found a pretty weak association.
Ordinal Vars - gamma(\(\gamma\))
Gamma is a measure of association for ordinal variables. Gamma ranges from -1.00 to 1.00. Again, a Gamma of 0.00 reflects no association; a Gamma of 1.00 reflects a positive perfect relationship between variables; a Gamma of -1.00 reflects a negative perfect relationship between those variables.
#Calculate Goodman Kruskal's Gamma statistic, a measure of association for ordinal factors in a two-way table.
#The function has interfaces for a table (matrix) and for single vectors.
# We need to create two ordinal variable
# let us say we have a variable indicating the level of total cases and another ordinal variable indicate the level of total deaths
# we use quantcut from gtools to split the deaths and cases into 10 quantile...
# install.packages("gtools")
library(gtools)
data_gov <- data_gov %>%
mutate(rank_death=quantcut(deaths,10),
rank_case=quantcut(cases,10))
GoodmanKruskalGamma(data_gov$rank_death,data_gov$rank_case)
## [1] 0.8393881
We found a pretty positive strong relationship. It is intuitive, right. If a state has more confirmed cases, it has more deaths.
Interval or Ratio Vars -Pearson’s r
Pearson’s r is a measure of association for continuous variables. Like Gamma, Pearson’s r ranges from -1.00 to 1.00.
## [1] 0.7931014
Regression Analysis
You can click here for a simple tutorial http://r-statistics.co/Linear-Regression.html
Let us say you are interested in explaining the total deaths in the u.s.
You think maybe the total cases can help explain it. Maybe party affiliation of the governor. Maybe other variables. Let us run a demo here.
# first let us plot the scatterplot of two vars
scatter.smooth(x=data_gov$cases, y=data_gov$deaths, main="Deaths ~ Cases") # scatterplot
It seems there is a linear relationship. let us fit a model.
Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is lm(). The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class formula. But the most common convention is to write out the formula directly in place of the argument as written below.
linearMod <- lm(deaths ~ cases, data=data_gov) # build linear regression model on full data
summary(linearMod)
##
## Call:
## lm(formula = deaths ~ cases, data = data_gov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4730.3 -1286.8 -584.3 -102.6 20766.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.122e+02 7.179e+02 0.435 0.666
## cases 2.380e-02 2.638e-03 9.021 6.6e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3857 on 48 degrees of freedom
## Multiple R-squared: 0.629, Adjusted R-squared: 0.6213
## F-statistic: 81.38 on 1 and 48 DF, p-value: 6.598e-12
This suggests that you can use the formula to predict the deaths: \[y^{hat}=311.6088+0.0238*cases\] The relationship is also statistically significant at the .001 level.
You can also add more independent variables. for instance,
linearMod2 <- lm(deaths ~ cases+party, data=data_gov) # build linear regression model on full data
summary(linearMod2)
##
## Call:
## lm(formula = deaths ~ cases + party, data = data_gov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5758.0 -1428.4 -697.5 533.0 19783.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.240e+03 8.909e+02 1.392 0.1706
## cases 2.391e-02 2.588e-03 9.236 3.9e-12 ***
## partyrepublican -1.821e+03 1.071e+03 -1.700 0.0958 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3783 on 47 degrees of freedom
## Multiple R-squared: 0.6505, Adjusted R-squared: 0.6356
## F-statistic: 43.74 on 2 and 47 DF, p-value: 1.868e-11
Let Us Move to Inferential Statistics
Chi-square
library(MASS) # load the MASS package
tbl = table(data_gov$rank_case, data_gov$rank_death) # the contingency table
chisq.test(tbl)
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 134, df = 81, p-value = 0.0001952
T-Test
let us say, if you want to check the group means difference between gender for total deaths.
x=data_gov %>% filter(gender=="male") %>% dplyr::select(deaths)
y=data_gov %>% filter(gender=="female") %>% dplyr::select(deaths)
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 2.5114, df = 38.662, p-value = 0.01632
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 634.8327 5897.4708
## sample estimates:
## mean of x mean of y
## 5110.707 1844.556
Let us plot the group mean difference
#install.packages(ggpubr)
library("ggpubr")
data_gov <- data_gov %>%
mutate(deaths_10k=deaths/1000)
ggboxplot(data_gov, x = "gender", y = "deaths_10k",
color = "gender", palette = c("#00AFBB", "#E7B800"),
ylab = "Deaths in 10K", xlab = "Groups")
The End