Lab3 Web Scraping and API 101
Learning Objects
This tutorial aims to introduce some basic ways to collect data via web scraping and API using R or python.
How to do basic web scraping with httr,rvest, and rselenium R packages. For selenium driver, we will use python as an example. You are asked to rewrite the code in R. The target website is elephrame.com and the main content is related to black lives matter movement.
How to read, clean, and transform dat to get basic statistics.
In the first part of python tutorial, we have covered some basic things about how to read and save files in Python, and how to use selenium to do webscraping. For more details on webscraping, you can check out this book https://link.springer.com/content/pdf/10.1007/978-1-4842-3582-9.pdf.
Lab 3 Part 1. Basic Intro to Python and Webscraping
We will focus more on R in this semester, but we also briefly introduce some basics on Python. You should know how to use python to process data and do machine learning. There are a couple of ways to access python like R. You can use jupyter notebook, google colab, terminal, Spyder, VScode, etc. Here we use Rstudio to access python.
Basic Intro to Python
We will use Python to collect data, but use R to tidy data. In the last week’s Tutorial, we covered how to install Python/Spyder. You should be Python-ready. You can open the our tutorial using jupyter notebook or google colab.
If your system does not have jupyter notebook, pls check here for more information about how to install jupyter notebook https://jupyter.org/install.
If your system has pip, you can simply run the following code in your terminal (note that remove # ):
#pip3 install notebook
#jupyter notebook
If you system does not install pip, you can check here for more details:https://pip.pypa.io/en/stable/installing/
Similarly, you can install modules like pandas, numpy, selenium,bs4 (beautifulsoup), tensorflow, pytorch, keras, etc.
We are not going to cover those Python Basics today. You can read NLP with Python Chapters1-3 for more details https://www.nltk.org/book/.
Today, we only introduce some basics related to webscraping.
1. Play with your working directory and then open and save files
# get current working directory
import os
=os.getcwd()
pathprint(path)
# create a new folder for our course
="./soc591/"
new_path os.makedirs(new_path)
# change the current working directory to soc591
eval=FALSE)
os.chdir(new_path,# check the current wd
print("Current wd: ",os.getcwd())
# let us create a new file in current WD, write some texts into the file, and then close it
= open("soc591.txt",mode="w+")
f = "id;text\n"
col_vars
f.write(col_vars)"1;This is a demo for writing some texts\n")
f.write( f.close()
# Let us read the soc591.txt file and assign it to variable text_df
= open("soc591.txt", "r").read()
text_df print(text_df)
# list file content
".") os.listdir(
# Let us remove the soc591.txt file
"soc591.txt") os.remove(
Basic info on webpages
Web-scraping is automating the ways how we gather data from websites. It is more efficient, but imposing burdens on servers. That is why a lot of websites develop anti-robot measures to prevent automate data gathering. You should always check robots.txt from the target website to see whether it allows you to scrape.
For most of the time, we can scrape those government websites because they disclose massive data for the public such as FEC and SEC websites.
Some of these websites are straightfoward. They are static. You can go their webpage and scrape all their stuff easily.
Here is an example of static html page
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>
</html>
But sometimes we have dynamic and interactive websites built on JavaScript, PHP, etc. For instance, a lot of data visualizaton websites, you have to click on something, then the website will return some results.
One solution to this is to use Selenium to simulate browser behavior.
Scraping Static Webpages
import pandas as pd
from bs4 import BeautifulSoup as bs
import urllib.parse
import urllib.request
def get_crp_industry_list(url):
''' Access Opensecrets.org website and return industry names and ids.
'''
# Specify userheader
= {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/600.7.12 (KHTML, like Gecko) Version/8.0.7 Safari/600.7.12"}
userHeader = urllib.request.Request(url, headers=userHeader)
req # open url and read web page
= urllib.request.urlopen(req)
response = response.read()
the_page # beautifulsoup parse html
=bs(the_page,"html.parser")
soup#print(soup)
# get all industry links and names
=soup.find("div",{"id":"rightColumn"}).find_all("a")
indList#print(indList)
# clean raw data
= []
indLinks = []
indNames for link in indList[1:]:
'href'].replace("indus.php?ind=",""))
indLinks.append(link[0].strip())
indNames.append(link.contents[#print(indLinks,indNames)
# create a dataset
= pd.DataFrame({"indLinks":indLinks,"indNames":indNames})
indDF return indDF
="https://www.opensecrets.org/industries/slist.php"
url=get_crp_industry_list(url)
data_ind_listprint(data_ind_list)
Using Selenium to Scrape Dynamic and Interactive Websites
So what is selenium? Selenium is an open-source web-based automation tool used for testing in the industry. But it can also used for wescraping, or “crawling/spiderig”.
Selenium can control your web browser and automate your browsing behavior. We will use Google chrome driver, but you can also other drivers like firefox,IE,etc.
#!pip3 install selenium
#pip3 install webdriver_manager
#pip3 install bs4
The Goal here is to scrape Black Lives Matter Data from
https://elephrame.com/textbook/BLM/chart
This website tracks the occurence of BLM protests. We cannot use the previous way to directly get the data because it is dynamic and you have to scroll down or click next page to get more data.
Let us build our cralwer from scratches…
- We need to install webdriver to control our browser
- We need to use our webdriver to control brower to establish a connection with the target website (sometimes you have to do someth log-in stuff, send pwd, etc.)
- We need to check the target webpage to locate the info we need
- Scrape the target info, open a file on local computer or your server, and save that info.
- We click the next page and repeat the scraping process until the end
- Close your webdriver
- We encapsulate the whole process (def a function or class to automate the whole process)
# Import modules for use
import os
import selenium
from selenium import webdriver
import time
import requests
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import ElementClickInterceptedException
from bs4 import BeautifulSoup as bs
# Install Driver
= webdriver.Chrome(ChromeDriverManager().install())
driver
# Open the url and establish a connection
= "https://elephrame.com/textbook/BLM/chart"
url 5)
driver.implicitly_wait(
driver.maximize_window()
driver.get(url)
# Scroll down to the bottom of the page
#driver.execute_script("window.scrollTo(0,window.scrollY+300)")
"window.scrollTo(0,document.body.scrollHeight)")
driver.execute_script(
# Read and parse the first page
= driver.page_source
first_page = bs(first_page,"html.parser")
soup
# Use google developer inspect to check the source codes
# locate the key info we need
# it stores ad div class = "item chart"
= soup.findAll("div",{"class":"item chart"})
items print(items)
# find necessay elements, including id, item-protest-location,protest-start,protest-end,item-protest-subject
# item-protest-participants (li), item-protest-time,item-protest-description, item-protest-url
import re
for item in items:
try:
id=re.findall(r'id="([0-9].*?)"',str(item))[0]
print(id)
except:
id=NA
try:
=' '.join(item.find("div",{"class":"item-protest-location"}).text.split())
protest_locationprint(protest_location)
except:
=""
protest_locationtry:
=' '.join(item.find("div",{"class":"protest-start"}).text.split())
protest_startprint(protest_start)
except:
=""
protest_starttry:
=' '.join(item.find("div",{"class":"protest-end"}).text.split())
protest_endprint(protest_end)
except:
=""
protest_endtry:
=' '.join(item.find("li",{"class":"item-protest-subject"}).text.split())
protest_subjectprint(protest_subject)
except:
=""
protest_subjecttry:
=' '.join(item.find("li",{"class":"item-protest-participants"}).text.split())
protest_participantsprint(protest_participants)
except:
=""
protest_participantstry:
=' '.join(item.find("li",{"class":"item-protest-time"}).text.split())
protest_timeprint(protest_time)
except:
=""
protest_timetry:
=' '.join(item.find("li",{"class":"item-protest-description"}).text.split())
protest_descriptionprint(protest_description)
except:
=""
protest_descriptiontry:
='##'.join(item.find("li",{"class":"item-protest-url"}).text.split())
protest_urlsprint(protest_urls,"\n")
except:
=""
protest_urls
# save the last item content into a tsv file for check
# check current dir
os.getcwd()#os.chdir()
import csv
with open('blm-data.tsv','w+') as f:
= csv.writer(f, delimiter='\t')
tsv_writer # write column names
=["protest_id", "protest_location","protest_start","protest_end","protest_subject","protest_participants",
var_names"protest_time","protest_description", "protest_urls"]
tsv_writer.writerow(var_names)# write actual data
=[protest_id, protest_location,protest_start,protest_end,protest_subject,protest_participants,
data
protest_time,protest_description, protest_urls]
tsv_writer.writerow(data)
# click the next page
# you can check here for more info on selenium how to locate elements
# https://selenium-python.readthedocs.io/locating-elements.html
import time
from selenium.webdriver.common.by import By
= driver.find_element(By.XPATH, '//*[@id="blm-results"]/div[3]/ul/li[4]')
next_page
next_page.click()5)
time.sleep(# then we repeat the process to the end
# Because we have 229 pages, so we need a loop to automate the process
= bs(driver.page_source,"html.parser")
soup # locate the page id
= soup.find("input",{"class":"page-choice"})["value"]
page_id = int(page_id)
page_id print(page_id)
'''
while page_id <=336:
# do first page scraping
# click next page
# repeat the scraping
# if page_id>336, then stop
'''
We can encapsulate the whole process into one single file https://yongjunzhang.com/files/scrape-blm.py
Lab 3 Part 2. Basic Intro to R and Preprossing Textual Data with R
We were able to successfully scrape the BLM protest events dataset. You can access the dataset https://yongjunzhang.com/files/blm-data.tsv
We need to load some packages for use
if (!requireNamespace("pacman"))
install.packages('pacman')
## Loading required namespace: pacman
library(pacman)
<-c("tidyverse","tidytext","rvest", "RSelenium","coreNLP",
packages"tm","haven","readxl","here","knitr","stopwords")
p_load(packages,character.only = TRUE)
Load blm-data tsv file
R tidyverse package provides a series of useful data wrangling tools. You can check it here https://www.tidyverse.org/. The tidyverse package installs a number of other packages for reading data:
DBI for relational databases. You’ll need to pair DBI with a database specific backends like RSQLite, RMariaDB, RPostgres, or odbc. Learn more at https://db.rstudio.com.
haven for SPSS, Stata, and SAS data.
httr for web APIs.
readxl for .xls and .xlsx sheets.
rvest for web scraping.
jsonlite for JSON. (Maintained by Jeroen Ooms.)
xml2 for XML.
# read tsv file using read_csv
<- read_tsv(url("https://yongjunzhang.com/files/blm-data.tsv")) data
## Rows: 6701 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): protest_location, protest_start, protest_end, protest_subject, prot...
## dbl (2): page_id, protest_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Show a table for visual check
::kable(data[1:3,],cap="Black Lives Matter Protest (elephrame.com)") knitr
page_id | protest_id | protest_location | protest_start | protest_end | protest_subject | protest_participants | protest_time | protest_description | protest_urls |
---|---|---|---|---|---|---|---|---|---|
1 | 6730 | London, England - happened? | Tuesday, October 19, 2021 | - Present | Subject(s): Local - Monument - Robert Geffyre, History - Slavery | Participant(s): Unclear | Time: Continuous | Description: Boycott of Museum of the Home over statue of Geffrye, who made money from slave trade | Source(s):##https://www.hackneycitizen.co.uk/2021/10/19/campaigners-museum-boycott-calls-teachers-families-join-slaver-statue/ |
1 | 4538 | Minneapolis, MN | Friday, June 5, 2020 | - Present | Subject(s): General - Police Brutality | Participant(s): 15 | Time: Continuous | Description: “Say Their Names” symbolic cemetery of 100 Black people killed by police | Source(s):##https://www.minneapolis.org/support-black-lives/38th-and-chicago/##https://theconversation.com/black-deaths-matter-the-centuries-old-struggle-to-memorialize-slaves-and-victims-of-racism-140613##https://www.mndaily.com/article/2020/06/say-their-names-cemetery##https://twitter.com/TwinCityReports/status/1272286571929178112##https://twitter.com/TwinCityReports/status/1272718480148660224 |
1 | 2948 | Minneapolis, MN | Tuesday, May 26, 2020 | - Present | Subject(s): George Floyd | Participant(s): Hundreds-Thousands (est.) | Time: Continuous | Description: Makeshift memorial at site where Floyd was killed | Source(s):##https://twitter.com/bengarvin/status/1267291981266530305##https://twitter.com/stribrooks/status/1266005779158568963##https://twitter.com/OmarJimenez/status/1265394526379806720 |
Clean blm-data
Let us say, we need to create variables like state and city; we also want to clean some variables like subjects, description, etc.
<- data %>%
data # split location
separate(protest_location,c("city","state"),sep = ",",remove = T) %>%
# split protest start time
separate(protest_start,c("day","date","year"),sep = ",",remove = T) %>%
# clean subjects and participants
mutate(
protest_subject = str_replace(protest_subject,"Subject\\(s\\): ",""),
protest_participants = str_replace(protest_participants,"Participant\\(s\\): ",""),
protest_time = str_replace(protest_time,"Time: ",""),
protest_description = str_replace(protest_description,"Description: ",""),
protest_urls = str_replace(protest_urls,"Source\\(s\\):","")
)
## Warning: Expected 2 pieces. Additional pieces discarded in 575 rows [52, 103,
## 140, 187, 189, 201, 207, 211, 224, 267, 293, 337, 469, 531, 686, 738, 855, 885,
## 886, 898, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 25 rows [135,
## 568, 617, 640, 737, 811, 1514, 1544, 1545, 1798, 2440, 2880, 3787, 4145, 5776,
## 5822, 5891, 5985, 6134, 6208, ...].
# Show a table for visual check
::kable(data[1:3,],cap="Black Lives Matter Protest (elephrame.com)") knitr
page_id | protest_id | city | state | day | date | year | protest_end | protest_subject | protest_participants | protest_time | protest_description | protest_urls |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | Boycott of Museum of the Home over statue of Geffrye, who made money from slave trade | ##https://www.hackneycitizen.co.uk/2021/10/19/campaigners-museum-boycott-calls-teachers-families-join-slaver-statue/ |
1 | 4538 | Minneapolis | MN | Friday | June 5 | 2020 | - Present | General - Police Brutality | 15 | Continuous | “Say Their Names” symbolic cemetery of 100 Black people killed by police | ##https://www.minneapolis.org/support-black-lives/38th-and-chicago/##https://theconversation.com/black-deaths-matter-the-centuries-old-struggle-to-memorialize-slaves-and-victims-of-racism-140613##https://www.mndaily.com/article/2020/06/say-their-names-cemetery##https://twitter.com/TwinCityReports/status/1272286571929178112##https://twitter.com/TwinCityReports/status/1272718480148660224 |
1 | 2948 | Minneapolis | MN | Tuesday | May 26 | 2020 | - Present | George Floyd | Hundreds-Thousands (est.) | Continuous | Makeshift memorial at site where Floyd was killed | ##https://twitter.com/bengarvin/status/1267291981266530305##https://twitter.com/stribrooks/status/1266005779158568963##https://twitter.com/OmarJimenez/status/1265394526379806720 |
Using Tidytest package to process some variables
There are a variety of processing text packages. Today we briefly introduce tidytext package. You can check herehttps://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html; This tidytext toturial heavily relies on Julia Silge and David Robinson’s work. You can also check their book Text Mining with R here https://www.tidytextmining.com/
library(tidytext)
# Let us say we are interested in protest description. We need to restructure it as one-token-per-row format. The unnest_tokens function is a way to convert a dataframe with a text column to be one-token-per-row:
<- data %>%
tidy_data # protest_urls is messay, let us get rid of it first
select(-protest_urls) %>%
# one token per row. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, ngrams, sentences, lines, paragraphs, or separation around a regex pattern.
unnest_tokens(word, protest_description) %>%
# remove stop words
anti_join(tidytext::get_stopwords("en",source="snowball")) %>%
# you can also add your own stop words if you want
# check here to see tibble data structure <https://tibble.tidyverse.org/>
anti_join(tibble(word=c("no","details")),by="word")
## Joining, by = "word"
::kable(tidy_data[1:10,],cap="Black Lives Matter Protest (elephrame.com)") knitr
page_id | protest_id | city | state | day | date | year | protest_end | protest_subject | protest_participants | protest_time | word |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | boycott |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | museum |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | home |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | statue |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | geffrye |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | made |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | money |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | slave |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | trade |
1 | 4538 | Minneapolis | MN | Friday | June 5 | 2020 | - Present | General - Police Brutality | 15 | Continuous | say |
Part 3. Basic Analysis of Textual Data
Let us get a count vector for protest description, like what are the most frequent words or bi-grams
%>%
tidy_data count(word, sort = TRUE)
## # A tibble: 4,628 × 2
## word n
## <chr> <int>
## 1 demonstration 2005
## 2 included 1995
## 3 march 1284
## 4 rally 1002
## 5 outside 575
## 6 police 502
## 7 national 478
## 8 walkout 458
## 9 game 432
## 10 anthem 426
## # … with 4,618 more rows
We can further plot this! I don’t like wordcloud, so I just do a simple bar plot.
library(ggplot2)
%>%
tidy_data count(word, sort = TRUE) %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
let us get bigram
%>%
data select(-protest_urls) %>%
unnest_tokens(bigram, protest_description,token = "ngrams", n = 2) %>%
count(bigram,sort = TRUE)
## # A tibble: 14,844 × 2
## bigram n
## <chr> <int>
## 1 for this 1968
## 2 no details 1968
## 3 this demonstration 1954
## 4 details included 1953
## 5 included for 1953
## 6 national anthem 408
## 7 healthcare workers 389
## 8 walkout of 388
## 9 prior to 387
## 10 during national 383
## # … with 14,834 more rows
You can check text mining with R book here https://www.tidytextmining.com/tfidf.html#tfidf
<- tidy_data %>%
tidy_words count(protest_id, word, sort = TRUE)
tidy_words
## # A tibble: 34,317 × 3
## protest_id word n
## <dbl> <chr> <int>
## 1 2414 march 4
## 2 1370 square 3
## 3 1480 high 3
## 4 2556 police 3
## 5 2767 city 3
## 6 4501 center 3
## 7 4753 workers 3
## 8 5762 nevada 3
## 9 6020 police 3
## 10 11 w 2
## # … with 34,307 more rows
Part 4. Let us use R to replicate some webscraping
Let us say we want to go deeper about BLM data. The BLM-DATA.TSV provides the original protest urls (news articles). We want to process those original articles to get more info.
Let us use httr or rvest to scrape a couple of news articles for example.
# create a dataset having urls, year, and ids
<- data %>%
urls select(protest_id,state,date,protest_urls) %>%
# let us extract all protest_urls
mutate(protest_urls=str_replace_all(protest_urls,"^##|^#","")) %>%
separate_rows(protest_urls,sep="##") %>%
filter(str_detect(protest_urls,"^http"))
# only keep one url for each protest
<- urls %>%
urls1 distinct(protest_id,.keep_all = T)
# let us take a look at the data
::kable(urls1[1:5,],cap="Black Lives Matter Protest (elephrame.com)") knitr
protest_id | state | date | protest_urls |
---|---|---|---|
6730 | England - happened? | October 19 | https://www.hackneycitizen.co.uk/2021/10/19/campaigners-museum-boycott-calls-teachers-families-join-slaver-statue/ |
4538 | MN | June 5 | https://www.minneapolis.org/support-black-lives/38th-and-chicago/ |
2948 | MN | May 26 | https://twitter.com/bengarvin/status/1267291981266530305 |
6754 | MN | February 5 | https://www.startribune.com/protest-of-police-killing-of-amir-locke-draws-hundreds-to-minneapolis/600143527/ |
6755 | MN | February 4 | https://minnesota.cbslocal.com/2022/02/05/car-caravan-rolls-through-downtown-minneapolis-in-protest-of-amir-locke-killing/ |
Use httr and rvest packages to get access to articles. Note that a lot of news articles need speciall subsription to get access such as facebook, wp, etc.
rvest and httr have a lot of functions. here is an overview (credit to GITHUB:yusuzech)
::include_graphics('rvest_httr.png') knitr
library(rvest)
library(httr)
##
## Attaching package: 'httr'
## The following object is masked from 'package:NLP':
##
## content
<- urls1$protest_urls[1]
url url
## [1] "https://www.hackneycitizen.co.uk/2021/10/19/campaigners-museum-boycott-calls-teachers-families-join-slaver-statue/"
# rvest provides two ways of making request: read_html() and html_session()
# read_html() can parse a HTML file or an url into xml document.
# html_session() is built on GET() from httr package.
#making GET request andparse website into xml document
<- read_html(url)
pagesource
#using html_session which creates a session and accept httr methods
<- session(url)
my_session
#html_session is built upon httr, you can access response with a session
<- my_session$response
response
#retrieve content as raw
<- content(my_session$response,as = "raw")
content_raw #retrieve content as text
<- content(my_session$response,as = "text")
content_text #retrieve content as parsed(parsed automatically)
<- content(my_session$response,as = "parsed") content_parsed
Obviously it returns a bunch of messy stuff; You need to use RSelenium to scrape these dynamic websites; we have learnt basic Selenium stuff in python. You can read this turorial for more details https://ropensci.org/tutorials/rselenium_tutorial/
Here let me briefly show you we can also do this in R
# connect to chrome driver
<- RSelenium::rsDriver(browser = "chrome",port = 4443L)
driver <- driver[["client"]]
remote_driver $navigate(url) remote_driver
# retrieve the article
<- remote_driver$findElement(using = "class", value="p402_premium")
main_article
<- main_article$getElementText() text
Text is a messy list, you need to do some cleaning again.
# let us clean those special characters like \n \t, etc.
<- text[[1]] %>%
tidy_text # remove all whitespaces, note it is regex \t
str_replace_all("\\s"," ") %>%
# reove some weird punct
str_replace_all('\\"',"") %>%
# remove some double spaces
%>%
str_squish # reve spaces at the begining and end of the text
%>%
str_trim # lower case
tolower()
tidy_text
VOILA, we have a nice tidy text!!!