Lab6-Text Analysis 101 Using R
Learning Objects
This tutorial aims to introduce basic ways to preprocess texual data before we model data using R. We will cover:
How to read, clean, and transform text data
How to preprocess data such as tokenization, removing stop words, lemmatization, stemming, and representing words in R
How to get basic statistics from texts using lexicon methods
In the previous tutorial, we have covered some basics about how to read and save files in R, how to recognize regEx, and how to use selenium to do webscraping.
We were able to successfully scrape the BLM protest events dataset. You can access the dataset
Note that some of these codes in this lab tutorial came from Lab 3.
Code Chanllenges
- You need to submit a script showing you are able to run the selenium and actually gather the data
- You need to scrape the original news articles using “blm-data.tsv” dataset (it has source urls) and make a structured textual data set. Some of these urls might not work or you don’t have access. It is fine that you skip those.
- You have two weeks to finish this.
Intro to Preprossing Textual Data with R
We need to load some packages for use
<-c("tidyverse","tidytext","rvest", "RSelenium","httr",
p_load(packages,character.only = TRUE)
Load blm-data tsv file
R tidyverse package provides a series of useful data wrangling tools. You can check it here The tidyverse package installs a number of other packages for reading data:
DBI for relational databases. You’ll need to pair DBI with a database specific backends like RSQLite, RMariaDB, RPostgres, or odbc. Learn more at
haven for SPSS, Stata, and SAS data.
httr for web APIs.
readxl for .xls and .xlsx sheets.
rvest for web scraping.
jsonlite for JSON. (Maintained by Jeroen Ooms.)
xml2 for XML.
# read tsv file using read_csv
<- read_tsv(url("")) data
# Show a table for visual check
::kable(data[1:3,],cap="Black Lives Matter Protest (") knitr
page_id | protest_id | protest_location | protest_start | protest_end | protest_subject | protest_participants | protest_time | protest_description | protest_urls |
1 | 6730 | London, England - happened? | Tuesday, October 19, 2021 | - Present | Subject(s): Local - Monument - Robert Geffyre, History - Slavery | Participant(s): Unclear | Time: Continuous | Description: Boycott of Museum of the Home over statue of Geffrye, who made money from slave trade | Source(s):## |
1 | 4538 | Minneapolis, MN | Friday, June 5, 2020 | - Present | Subject(s): General - Police Brutality | Participant(s): 15 | Time: Continuous | Description: “Say Their Names” symbolic cemetery of 100 Black people killed by police | Source(s):## |
1 | 2948 | Minneapolis, MN | Tuesday, May 26, 2020 | - Present | Subject(s): George Floyd | Participant(s): Hundreds-Thousands (est.) | Time: Continuous | Description: Makeshift memorial at site where Floyd was killed | Source(s):## |
Clean blm-data
Let us say, we need to create variables like state and city; we also want to clean some variables like subjects, description, etc.
<- data %>%
data # split location
separate(protest_location,c("city","state"),sep = ",",remove = T) %>%
# split protest start time
separate(protest_start,c("day","date","year"),sep = ",",remove = T) %>%
# clean subjects and participants
protest_subject = str_replace(protest_subject,"Subject\\(s\\): ",""),
protest_participants = str_replace(protest_participants,"Participant\\(s\\): ",""),
protest_time = str_replace(protest_time,"Time: ",""),
protest_description = str_replace(protest_description,"Description: ",""),
protest_urls = str_replace(protest_urls,"Source\\(s\\):","")
# Show a table for visual check
::kable(data[1:3,],cap="Black Lives Matter Protest (") knitr
page_id | protest_id | city | state | day | date | year | protest_end | protest_subject | protest_participants | protest_time | protest_description | protest_urls |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | Boycott of Museum of the Home over statue of Geffrye, who made money from slave trade | ## |
1 | 4538 | Minneapolis | MN | Friday | June 5 | 2020 | - Present | General - Police Brutality | 15 | Continuous | “Say Their Names” symbolic cemetery of 100 Black people killed by police | ## |
1 | 2948 | Minneapolis | MN | Tuesday | May 26 | 2020 | - Present | George Floyd | Hundreds-Thousands (est.) | Continuous | Makeshift memorial at site where Floyd was killed | ## |
Using Tidytest package to process some variables
There are a variety of processing text packages. Today we briefly introduce tidytext package. You can check here; This tidytext toturial heavily relies on Julia Silge and David Robinson’s work. You can also check their book Text Mining with R here
# Let us say we are interested in protest description. We need to restructure it as one-token-per-row format. The unnest_tokens function is a way to convert a dataframe with a text column to be one-token-per-row:
<- data %>%
tidy_data # protest_urls is messay, let us get rid of it first
select(-protest_urls) %>%
# one token per row. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, ngrams, sentences, lines, paragraphs, or separation around a regex pattern.
unnest_tokens(word, protest_description) %>%
# remove stop words
anti_join(tidytext::get_stopwords("en",source="snowball")) %>%
# you can also add your own stop words if you want
# check here to see tibble data structure <>
## Joining, by = "word"
::kable(tidy_data[1:10,],cap="Black Lives Matter Protest (") knitr
page_id | protest_id | city | state | day | date | year | protest_end | protest_subject | protest_participants | protest_time | word |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | boycott |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | museum |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | home |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | statue |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | geffrye |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | made |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | money |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | slave |
1 | 6730 | London | England - happened? | Tuesday | October 19 | 2021 | - Present | Local - Monument - Robert Geffyre, History - Slavery | Unclear | Continuous | trade |
1 | 4538 | Minneapolis | MN | Friday | June 5 | 2020 | - Present | General - Police Brutality | 15 | Continuous | say |
# let us see what stopwords we are excluding
<- tidytext::get_stopwords()
stopwords_d $word stopwords_d
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"
Basic Analysis of Textual Data
Let us get a count vector for protest description, like what are the most frequent words or bi-grams
tidy_data count(word, sort = TRUE)
## # A tibble: 4,628 × 2
## word n
## <chr> <int>
## 1 demonstration 2005
## 2 included 1995
## 3 march 1284
## 4 rally 1002
## 5 outside 575
## 6 police 502
## 7 national 478
## 8 walkout 458
## 9 game 432
## 10 anthem 426
## # … with 4,618 more rows
We can further plot this! I don’t like wordcloud, so I just do a simple bar plot.
tidy_data count(word, sort = TRUE) %>%
filter(n > 100) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
let us get bigram
data select(-protest_urls) %>%
unnest_tokens(bigram, protest_description,token = "ngrams", n = 2) %>%
count(bigram,sort = TRUE)
## # A tibble: 14,844 × 2
## bigram n
## <chr> <int>
## 1 for this 1968
## 2 no details 1968
## 3 this demonstration 1954
## 4 details included 1953
## 5 included for 1953
## 6 national anthem 408
## 7 healthcare workers 389
## 8 walkout of 388
## 9 prior to 387
## 10 during national 383
## # … with 14,834 more rows
Note you can use joining functions to filter these words or ngrams… such as inner_join, anti_join, semi_join, etc.
We can also use tidytext to build document-term matrix or tf-idf. We will cover this next time when we talk about topic modeling.
In the lecture, we briefly mentioned how we represent text in NLP. In Text as Data, the authors mainly summarized the approach of “bag of words” (CoW). It is just one approach to quantify what a document is about. How important a word may be in your document or in the entire corpus (collection of documents)?
One measure of the importance of a word is its term frequency (tf). It captures the frequency of a word in a document. There are very frequent words in a document but may not be important; in English, some stopwords, like “the”, “is”, “of”, “and”, etc. So we need to remove them before analysis based on your research. But for other scholar’s that might be of their interest.
Another way is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.
You can check text mining with R book here
<- tidy_data %>%
tidy_words count(protest_id, word, sort = TRUE)
## # A tibble: 34,317 × 3
## protest_id word n
## <dbl> <chr> <int>
## 1 2414 march 4
## 2 1370 square 3
## 3 1480 high 3
## 4 2556 police 3
## 5 2767 city 3
## 6 4501 center 3
## 7 4753 workers 3
## 8 5762 nevada 3
## 9 6020 police 3
## 10 11 w 2
## # … with 34,307 more rows
The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.
The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (book in this case), and the last necessary column contains the counts, how many times each document contains each term.
<- tidy_words %>%
tidy_words bind_tf_idf(word, protest_id, n)
## # A tibble: 34,317 × 6
## protest_id word n tf idf tf_idf
## <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 2414 march 4 0.222 1.68 0.373
## 2 1370 square 3 0.188 4.27 0.800
## 3 1480 high 3 0.176 5.20 0.917
## 4 2556 police 3 0.231 2.63 0.607
## 5 2767 city 3 0.188 3.07 0.575
## 6 4501 center 3 0.25 3.34 0.835
## 7 4753 workers 3 0.231 2.83 0.653
## 8 5762 nevada 3 0.2 8.12 1.62
## 9 6020 police 3 0.167 2.63 0.439
## 10 11 w 2 0.182 3.85 0.699
## # … with 34,307 more rows
Let us take a look at those relatively important words
tidy_words arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(protest_id) %>%
filter(protest_id%in%sample(2000:5000, 4, replace=F)) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = protest_id)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~protest_id, ncol = 2, scales = "free") +
## Selecting by tf_idf
Use RSelenium Package to Obtain Data
Let us say we want to go deeper about BLM data. The BLM-DATA.TSV provides the original protest urls (news articles). We want to process those original articles to get more info.
Let us use rselenium package to scrape a couple of news articles for example.
Here is a tutorial by Josh McCrain
# create a dataset having urls, year, and ids
<- data %>%
urls select(protest_id,state,date,protest_urls) %>%
# let us extract all protest_urls
mutate(protest_urls=str_replace_all(protest_urls,"^##|^#","")) %>%
separate_rows(protest_urls,sep="##") %>%
# only keep one url for each protest
<- urls %>%
urls1 distinct(protest_id,.keep_all = T)
# let us take a look at the data
::kable(urls1[1:5,],cap="Black Lives Matter Protest (") knitr
protest_id | state | date | protest_urls |
6730 | England - happened? | October 19 | |
4538 | MN | June 5 | |
2948 | MN | May 26 | |
6754 | MN | February 5 | |
6755 | MN | February 4 | |
You need to use RSelenium to scrape these dynamic websites; we have learnt basic Selenium stuff in python. You can read this turorial for more details We also have a tutorial using python selenium to scrape data in previous lab.
Here let me briefly show you we can also do this in R
# connect to chrome driver
<- RSelenium::rsDriver(browser = "firefox",port=4567L, verbose=F)
driver <- driver[["client"]]
remote_driver $navigate(urls1$protest_urls[1]) remote_driver
# retrieve the article
<- remote_driver$findElement(using = "id", value="fl-post-254675")
<- main_article$getElementText() text
"server"]]$stop() driver[[
## [1] TRUE
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2972654 158.8 5581532 298.1 NA 5581532 298.1
## Vcells 6210446 47.4 12255594 93.6 102400 10146317 77.5
Text is a messy list, you need to do some cleaning again.
# let us clean those special characters like \n \t, etc.
<- text[[1]] %>%
tidy_text # remove all whitespaces, note it is regex \t
str_replace_all("\\s"," ") %>%
# reove some weird punct
str_replace_all('\\"',"") %>%
# remove some double spaces
str_squish # reve spaces at the begining and end of the text
str_trim # lower case
## [1] "campaigners ramp up museum boycott with calls for teachers and families to join them until slaver statue is removed by julia gregory, local democracy reporter | tuesday 19 october 2021 at 19:22 campaigners have been protesting against the statue for more than a year campaigners are calling for a boycott of a hoxton museum until the statue of a merchant “invoved in an industry which contributed to the rape, torture and murder” of enslaved people is removed. hackney stand up to racism is urging people to boycott the museum of the home until it removes the statue of sir robert geffrye, who made his some of his money from transatlantic slavery in the 17th century. they renewed their call for a boycott at the meeting house on newington green, which was associated with the campaign to abolish slavery. they want teachers, youth groups and families to stop taking trips to the museum until the statue is taken down. dalston councillor soraya adejare said robert geffrye made his money on the back of the misery of others. “it’s an affront to common decency,” she added. she said it rubbed salt into the wounds of a commnunity as diverse as hackney to see the statue of a trader who benefitted from slavery and questioned the government’s intervention to prevent statues like this being removed. cllr sade etti, hackney council’s no place for hate champion and mayoral advisor on homelessness, said: “statues of those involved in slavery ought to be pulled down and removed. it is morally reprehensible to continue to support their existence.” subira cameron-goppy from the claudia jones organisation. photograph: courtesy claudia jones organisation subira cameron-goppy from the claudia jones organisation, which is also supporting the boycott, considers robert geffyre’s involvement in slavery a hate crime. she said: “as an african caribbean community, how are we to see this statue?” removing it is “not removing history, it is truly telling the truth of history,” she added. the sculpture stands outside the museum, which is housed in former almshouses which robert geffrye helped to fund. he is not connected to the founding of the museum or its collections. david davies from the hackney branch of the national education union said: “we are not asking for the statue to be thrown into the river lea. we are asking for the statue to be conceptualised.” the museum would have to get planning permission to remove the statue. it could also affect the building’s grade-i listed status. hackney north and stoke newington mp diane abbott said: “the entire history of slavery and colonialism were shameful eras. we should not be honouring the slavers and colonialists, we should be disowning them and disavowing them. we should also be teaching people about the most shameful aspects of that history.” she told campaigners: “you have right and you have the future on your side. geffrye must fall!” kurdish and turkish community group day-mer also voiced its support for the statue’s removal. paint on the entrance of the museum last year the museum of the home told the citizen: “at present we have no comment to offer.” it consulted people last year about the future of the statue and most of the 2,000 who responded said it should go. earlier this year, then communities secretary robert jenrick said there will be new legal safeguards for historic monuments at risk of removal or relocation. this followed the toppling of a statue of edward colston in bristol during a black lives matter protest. colston had connections to the slave trade. previously, the museum of the home said that it was doing work to explain the story of the statue. a spokesperson said: “the first step has been to install a panel near the statue telling a fuller history of geffrye, including his connections with the forced labour and trading of enslaved africans, and acknowledging that the statue is the subject of much discussion. “we will confront, challenge and learn from the uncomfortable truths of the origins of the museum buildings, and fulfil our commitment to diversity and inclusion.” 1 shares share 1 tweet support us the coronavirus outbreak meant that the hackney citizen was unable to print a monthly newspaper for three months. we're grateful that we have since been able to resume printing. this would not have been possible without the generosity of our readers, whose donations kept the paper from disappearing completely at a distressing time for residents. a huge thank you to everyone who gave their time and money to support us through the lockdown, and to those who continue to do so as we slowly recover from the dramatic fall in advertising revenues, on top of the existing challenges threatening the future of local journalism. a one-off donation or a regular contribution from anyone who can afford it will help our small team keep the newspaper in print and the website running in the coming months and years. find out how you can donate. thank you for your support, and stay safe. the hackney citizen team ← drinkers face £100 fine if they misbehave in public anywhere in hackney under new powers agreed by town hall ‘it’s a big shock’: families protest outside hackney town hall as they step up fight to save two children’s centres →"
VOILA, we have a nice tidy text!!!
You can write a loop to go through all these urls and scrape the entire page for each url. Later then you clean and build a database.
For more tutorial, you can check here again: <>
Let us then use quanteda package do some text processing in R (FINALLY :))
Check here for Quanteda
quanteda, Quantitative Analysis of Textual Data, is an R package for managing and analyzing textual data developed by Kenneth Benoit, Kohei Watanabe, and other contributors.
The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
You are also encourage to install several recommended packages, including readtext, spacyr, and quanteda.corpora.
In this part, we however use some new york times articles to run analysis. If you manage to obtain all protest articles, you can use these protest articles as well. If not, you can use the small sample of nyt dataset. It has title_doca, text, and title_proquest. The title_doca ALLOWs you to merge nyt articles with doca data.
Note that you can download the doca raw dataset from this link: Then you can merge doca data with nyt articles. Ideally you can treat doca dataset as your TRAINING dataset, and you can train some models to predict protest related outcomes.
Let us build a doca nyt corpus
Quanteda has a corpus constructor command corpus(): - a vector of character objects, for instance that you have already loaded into the workspace using other tools; - a VCorpus corpus object from the tm package. - a data.frame containing a text column and any other document-level metadata
<- corpus(doca_nyt) # build a new corpus from the texts
doca_nyt_corpus #summary(doca_nyt_corpus)
How a quanteda corpus works
A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.
A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analyzing a reading ease index – can be performed on the same corpus.
To extract texts from a corpus, we use an extractor, called texts().
## text2
## "Homosexuals at Harvard Protesting Navy Hiring\n‘New York Times (1923-Current file); Mar 21, 1983: ProQuest Historical Newspapers: The New York Times\nps. Al2\n\nHomosexuals at Harvard\nProtesting Navy Hiring\n\nCAMBRIDGE, Mass., March 20 (AP)\n— A drive by homosexual students at\nHarvard University to hold a campus\nforum about Navy hiring practices\nthreatens the university with the loss of\n$3 million in Defense Department\nfunds, a university official said today.\n\nThe Navy has refused to attend sucha\nforum, required under university regulations when 500 or more students sign a\npetition demanding it, according to\nArchie C. Epps 3d, the dean of students.\n\nThe Harvard-Radcliffe Gay and Lesbian Students Association has collected\n400 signatures on petitions requesting a\nforum with the Navy and is expected to\nhave the required number soon, said\nGeorge A. Broadwell, the chairman.\n\nThe 1973 Military Procurement Act\nstates ‘‘no funds appropriated for the\nDepartment of Defense may be used for\nany institution of higher learning if the\nSecretary of Defense or his designee\ndetermines that recruiting personnel of\nany of the armed forces are barred by\npolicy from the institution’s premises.’’\n\nReproduced with permission of the copyright owner. Further reproduction prohibited without permission."
Exploring corpus texts: The kwic function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs:
## Homosexuals at Harvard |
## . Al2 Homosexuals at Harvard |
## Join in Lobby at Washington |
## Join in Lobby at Washington |
## UPSETS FACULTY 108 Top Men |
## too.. The nine |
## call the police. The |
## least three separate groups of |
## Consul Plays Host to Irish |
## Consul Plays Host to frish |
## the dispute between Catholics and |
## between the Catholics and the |
Tokenize texts: To simply tokenize a text, quanteda provides a powerful command called tokens(). This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.
tokens(texts(doca_nyt_corpus)[2],remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE)
## Tokens consisting of 1 document.
## text2 :
## [1] "Homosexuals" "at" "Harvard" "Protesting" "Navy"
## [6] "Hiring" "New" "York" "Times" "1923-Current"
## [11] "file" "Mar"
## [ ... and 178 more ]
Constructing a document-feature matrix
Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing a document-feature matrix. For this, we have a Swiss-army knife function, called dfm(), which performs tokenization and tabulates the extracted features into a matrix of documents by features. Unlike the conservative approach taken by tokens(), the dfm() function applies certain options by default, such as tolower() – a separate function for lower-casing texts – and removes punctuation.
# make a dfm
<- dfm(doca_nyt_corpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
my_dfm 1:5] my_dfm[,
## Document-feature matrix of: 2,000 documents, 5 features (80.8% sparse) and 2 docvars.
## features
## docs dinkin lead call stop street
## text1 8 3 8 6 10
## text2 0 0 0 0 0
## text3 0 0 0 0 0
## text4 0 0 3 0 0
## text5 0 1 0 1 6
## text6 0 0 2 0 2
## [ reached max_ndoc ... 1,994 more documents ]
Viewing the document-feature matrix: The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View function. Calling plot on a dfm will display a wordcloud.
topfeatures(my_dfm, 20) # 20 top words
## said new | time york mr permiss school
## 10883 8439 6963 6711 6660 5553 4398 3716
## student state citi group polic black without protest
## 3523 3372 2978 2916 2836 2728 2674 2620
## one copyright owner year
## 2574 2485 2462 2407
textplot_wordcloud(my_dfm, min_count = 6, random_order = FALSE,
rotation = .25,
colors = RColorBrewer::brewer.pal(8,"Dark2"))
## Warning: colors is deprecated; use color instead
> term similarities
<- textstat_simil(my_dfm, my_dfm[, c("protest", "demonst", "student")],
sim method = "cosine", margin = "features")
lapply(as.list(sim), head, 10)
## $protest
## permiss reproduct proquest reproduc 1923-current demonstr
## 0.5171692 0.5134348 0.5124899 0.5110776 0.5059638 0.5029151
## copyright york prohibit without
## 0.5025317 0.4985626 0.4913091 0.4880232
## $demonst
## sonspeci fre2 spcech ascend disacre safeauard nublic cerruti
## 0.4082483 0.4082483 0.4082483 0.4082483 0.4082483 0.4082483 0.4082483 0.4082483
## 16-block riotscar
## 0.4082483 0.4082483
## $student
## campus univers colleg faculti class demand dent administr
## 0.7033191 0.6401076 0.5833331 0.4972651 0.4505656 0.4388754 0.3952023 0.3910930
## without columbia
## 0.3889639 0.3881382