Our Goal

This tutorial aims to help students set up their computer system to do some basic web scraping. It includes basic steps to install R, Rstudio, and packages, etc.

Install R and Rstudio

Let us do it from a scratch. We are using Rstudio in this tutorial. Rstudio is the most popular R code editor….

Before installing Rstudio, you need to Install R first. You can download R and other necessary software by clicking here https://cran.r-project.org/. You can choose the appropriate version for your system (e.g., windows, Mac). Be careful to follow its installing instruction, especially regarding those necessary software like xquartz if you are a mac user.

After this step, please go to RStudio website to download and install RStudio desktop. You can click here https://rstudio.com/products/rstudio/download/ and choose the free version.

Then, you can use install.packages function to install necessary R packages. But I suggest you to copy and paste following code to install some common packages for data processing and visualization. You can add any packages you want to install by defining the “packages” variable.

Packages are developed by the R user community to achieve different goals. For instance, tidyverse is a wrapper for a lot of different packages designed for data science, including dplyr, tidyr, readr,purr, tibble, etc. Click here for more details https://www.tidyverse.org/packages/

# pacman is a package help you manage packages: load installed packages; or install and load packages if you have not installed them
# you can use install.packages("package_name") to install necessary packages
# in this tutorial, we need the following packages:
if (!requireNamespace("pacman"))
  install.packages('pacman')
library(pacman)
packages<-c("tidyverse","tidytext","rvest","httr","RSelenium","stm","tm",
            "ggplot2","here", "twitteR")
p_load(packages,character.only = TRUE)

Some Basics about R

Assign Values to Variables

In R, you can use the assignment operator <- to create new variables. I also see people use the equal sign (“=”) to do that. But for code consistency, we should stick to the <- sign.

If you are using windows system, you can press “alt” and “-” simultaneously to type <- In mac, I believe it is opt and -

# One example of assigning values to variables

# let us create a vriable "a" and assign a value of 5 to it.
a <- 5
a

## [1] 5

# let us create a variable b and assign a value 100 to it

b <- 100
b

## [1] 100

# you can also assign a string

c <- "hello word"
c

## [1] "hello word"

Arithmetic and Logical Operators

Arithmetic Operator	Description
+	addition
-	subtraction
*	multiplication
/	division
^ or **	exponentiation

Logical Operator	Description
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to

Let us see some examples

# we can do some operation in R
# let us create a new variable d that stores a and b
d <- a + b
d

## [1] 105

# let do other arithmetic operations
e <- a - b
e

## [1] -95

f <- a * b
f

## [1] 500

g <- a ** b
g

## [1] 7.888609e+69

# but you cannot do this for different data types
# h <- a + c
# h

# let us try some logical operation
# Note that a = 5
a == 5

## [1] TRUE

a <= 100

## [1] TRUE

a !=100

## [1] TRUE

# as you see here, it returns TRUE OR FALSE

Data Types

R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists.

Types	Examples
scalars	a <- 1; b<-“hello word”; c <- (a ==1)
vectors	a <- c(1, 2, 3, 4); b <- c (“a”, “b”, “c”)
matrices	let us see a 2 by 3 matrix: a <- matrix(c(2, 4, 3, 1, 5, 7), # the data elements nrow=2, # number of rows ncol=3, # number of columns byrow = TRUE) # fill matrix by rows
data frames	A data frame is used for storing data tables. It is a list of vectors of equal length. For example, df is a data frame containing three vectors n, s, b. n = c(2, 3, 5) s = c(“aa”, “bb”, “cc”) b = c(TRUE, FALSE, TRUE) df = data.frame(n, s, b)
lists	Lists contain elements of different types like − numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using list() function. list_example<- list(“Red”, “Green”, c(1,2,3), TRUE, 520, 10000)

let us see some examples:

# scalars
a <-  2
class(a)

## [1] "numeric"

# vectors
b <- c("a","b","c")
b

## [1] "a" "b" "c"

class(b)

## [1] "character"

# matrix
c <- matrix(c(2, 4, 3, 1, 5, 7),nrow=2, ncol=3,byrow = TRUE)
c

##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7

class(c)

## [1] "matrix" "array"

# data frame
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df

##   n  s     b
## 1 2 aa  TRUE
## 2 3 bb FALSE
## 3 5 cc  TRUE

class(df)

## [1] "data.frame"

# lists
list_example<- list("Red", "Green", c(1,2,3), TRUE, 520, 10000)
list_example

## [[1]]
## [1] "Red"
## 
## [[2]]
## [1] "Green"
## 
## [[3]]
## [1] 1 2 3
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] 520
## 
## [[6]]
## [1] 10000

class(list_example)

## [1] "list"

Functions

In R you can define a function to achieve certain goals.

# let us define a function that  print out the texts you input...

print_text <- function(x){
  print(x)
}

x <- "Hello Word, R."
print_text(x)

## [1] "Hello Word, R."

Loops

A for loop is used to apply the same function calls to a collection of objects. R has a family of loops. Usually we should avoid for loop when you are coding. But in this tutorial, we will use for-loop as an example.

# let us say you have a list of texts that you want to print out.
texts <- c("this is a text","this is a text","this is a text","this is a text",
           "this is a text","this is a text","this is a text","this is a text",
           "this is a text","this is a text","this is a text","this is a text")
# now you want to use print_text function to print out the content..
# one way you can do is to mannually specify all texts
# like print_text(texts[1])... but it is tedious...
# we can do a for loop to print out all texts easily

for(text in texts){
  print_text(text)
}

## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"
## [1] "this is a text"

Now we have some basic ideas of what R can do for us. Let us move to web scraping. You will see how we can store data as variables, dataframes, or vectors as well as how we can write a loop to automate the tedious downloading stuff. Hope you have some fun….

Scraping a Static Webpage

Let us say, if you are interested in studying shareholder activism, you are trying to get all information on how managers deal with those shareholders proposal via SEC.

You might want to collect all those no action letter pdfs via SEC website: https://www.sec.gov/corpfin/2019-2020-shareholder-proposals-no-action. This website lists all those pdfs. What we need to do is to write some lines of codes to automate the whole scraping process in r or python, but we will do this in R.

library(rvest)
library(httr)

# Specify the url you want to scrape, let us just focus on the year of 2020
url <- "https://www.sec.gov/corpfin/2019-2020-shareholder-proposals-no-action#alpha"

# some pseudo codes
# 1. we need to establish a connection with the web page
# 2. read the html web page
# 3. Parse the html page and get the data we want (pdf links)
# 4. write a loop to download all pdfs

# rvest provides two ways of making request: read_html() and html_session() 
# read_html() can parse a HTML file or an url into xml document. 
# html_session() is built on GET() from httr package.

#making GET request and parse website into xml document
page_source <- read_html(url)

# we need to access those nodes data, related to pdfs
# if you look at those pdf links, they actually stored in 
# tags like this <a href="https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/ctwinvestmentalphabet030620-14a8.pdf"> apple inc </a>
pdfs <- page_source %>% 
  # we get all those tags "a"
  html_nodes("a") %>%
  # we get attributes href
  html_attr("href") %>%
  # make this as a data frame
  as_tibble() %>% 
  # only keep those pdf links
  filter(str_detect(value,"/divisions/corpfin/cf-noaction/14a-8")==TRUE) %>% 
  transmute(id=row_number(),
            pdfs=paste0("https://www.sec.gov",value))
# YOU CAN use html_text() to get all texts in tag a
# let us print out the first 10 rows
knitr::kable(pdfs[1:10,])

id	pdfs
1	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/vpicabbott021220-14a8.pdf
2	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/cheveddenabbott020520-14a8.pdf
3	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/cheveddenabbottrecon022720-14a8.pdf
4	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/steinerabbott012920-14a8.pdf
5	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/riersvpic012920-14a8.pdf
6	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/procapaddus032720-14a8.pdf
7	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/ritcheralcoa100920-14a8.pdf
8	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/steinercheveddenallstate010920-14a8.pdf
9	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/ncppralphabet040920-14a8.pdf
10	https://www.sec.gov/divisions/corpfin/cf-noaction/14a-8/2020/ncppralphabetrecon041520-14a8.pdf

Let us write a loop to download all pdfs

# in r you can use download.file to download things
# let use specify the target folders first
# setwd("YOUR FOLDER")
for(pdf in pdfs$pdfs[1:10]){
  file_name=str_replace(pdf,"^.*\\/","")
  download.file(pdf, file_name, mode="wb")
}

More advanced scraping.

Sometimes the website is dynamic, and you cannot use rvest to colelct data. You can try rselenium. It basically mimics browser behavior. It controls your browser to automatically go to the website and collect data like a human being.

You can read this turorial for more details https://ropensci.org/tutorials/rselenium_tutorial/

Using Twitter API to Scrape Tweets

First you need to apply for a developer account to get the following keys and secrets used to access twitter data

You can click here for more details https://developer.twitter.com/en/apply-for-access

Once you have the following info, you can write codes to access twitter…

# We need twitterR package to access and query data from twitter
library(twitteR)
consumer_key <- "your_consumer_key"
consumer_secret <- "your_consumer_secret"
access_key <- "your_access_token"
access_secret <- "your_access_secret"
source("twitter.R") # I store my authentification info in a local R file
setup_twitter_oauth(consumer_key, consumer_secret, access_key, access_secret)

## [1] "Using direct authentication"

Let us say you are interested in covid19 hashtag.

# let use searchTwitter function to search covid19 
# here is the documentation of this function <https://www.rdocumentation.org/packages/twitteR/versions/1.1.9/topics/searchTwitter>
# we store the data as covid_twitter
covid_twitter <- searchTwitter("covid19",n=1000)
# we convert is to a data frame using twListToDF
covid_twitter_df <- twListToDF(covid_twitter)
# We then tokenize all the words using tidytext package function unnest_tokens
tweet_words <- covid_twitter_df %>% select(id, text) %>% unnest_tokens(word,text)
# We then plot it...
tweet_words %>% 
  count(word,sort=T) %>% # we count each the frequency of each word
  slice(1:20) %>% # let us focus on the top 20 words
  ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")

The plot is messy.. let us get rid of some words..

# Create a list of stop words: a list of words that are not worth including
# stop words are those words not informative, such as a an the etc..
my_stop_words <- stop_words %>% 
  select(-lexicon) %>% 
  bind_rows(data.frame(word = c("https", "t.co", "rt", "amp","covid19","de","la","el","en","2","le","1")))

tweet_words_interesting <- tweet_words %>% 
  anti_join(my_stop_words)

tweet_words_interesting %>% 
  group_by(word) %>% 
  tally(sort=TRUE) %>% 
  slice(1:25) %>% 
  ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")

The End

Basic Web Scraping in Sociology