Lab7 - Topic Modeling in Social Science Using R
Learning Objects
This tutorial aims to introduce basic ways to preprocess textual data and implement topic modeling using R. We will cover:
How to read, clean, and transform text data
How to preprocess data such as tokenization, removing stop words, lemmatization, stemming, and representing words in R
How to get basic statistics from texts using lexicon methods
How to implement lda and stm in R.
In the previous tutorial, we have covered some basics regarding objective 1, 2, and 3. So in this tutorial we mainly focus on objective 4.
Note that some of these codes in this lab tutorial came from Lab 6.
Intro to Preprossing Textual Data with R
We need to load some packages for use
require(pacman)
## Loading required package: pacman
p_load(tidyverse,tidytext,quanteda,knitr,stopwords)
We need to construct a document-term matrix for further quantitative texual analysis
In previous lecture, we briefly mentioned how we represent text in NLP. In Text as Data, the authors mainly summarized the approach of “bag of words” (CoW). It is just one approach to quantify what a document is about. How important a word may be in your document or in the entire corpus (collection of documents)?
One measure of the importance of a word is its term frequency (tf). It captures the frequency of a word in a document. There are very frequent words in a document but may not be important; in English, some stopwords, like “the”, “is”, “of”, “and”, etc. So we need to remove them before analysis based on your research. But for other scholar’s that might be of their interest.
Another way is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.
The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.
Let us use quanteda package to prepare dtm in R
Check here for Quantedahttps://quanteda.io/articles/pkgdown/quickstart.html
quanteda, Quantitative Analysis of Textual Data, is an R package for managing and analyzing textual data developed by Kenneth Benoit, Kohei Watanabe, and other contributors.
The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
You are also encourage to install several recommended packages, including readtext, spacyr, and quanteda.corpora.
We use some new york times articles to run analysis. Tht nyt dataset has title_doca, text, and title_proquest. The title_doca ALLOWs you to merge nyt articles with doca data.
Note that you can download the doca raw dataset from this link: https://web.stanford.edu/group/collectiveaction/cgi-bin/drupal/. Then you can merge doca data with nyt articles. Ideally you can treat doca dataset as your TRAINING dataset, and you can train some models to predict protest related outcomes.
load("./doca_nyt.rdata")
Let us build a doca nyt corpus
Quanteda has a corpus constructor command corpus(): - a vector of character objects, for instance that you have already loaded into the workspace using other tools; - a VCorpus corpus object from the tm package. - a data.frame containing a text column and any other document-level metadata
.
<- corpus(doca_nyt) # build a new corpus from the texts
doca_nyt_corpus #summary(doca_nyt_corpus)
How a quanteda corpus works
A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.
A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analyzing a reading ease index – can be performed on the same corpus.
To extract texts from a corpus, we use an extractor, called texts().
texts(doca_nyt_corpus)[2]
## text2
## "Homosexuals at Harvard Protesting Navy Hiring\n‘New York Times (1923-Current file); Mar 21, 1983: ProQuest Historical Newspapers: The New York Times\nps. Al2\n\nHomosexuals at Harvard\nProtesting Navy Hiring\n\nCAMBRIDGE, Mass., March 20 (AP)\n— A drive by homosexual students at\nHarvard University to hold a campus\nforum about Navy hiring practices\nthreatens the university with the loss of\n$3 million in Defense Department\nfunds, a university official said today.\n\nThe Navy has refused to attend sucha\nforum, required under university regulations when 500 or more students sign a\npetition demanding it, according to\nArchie C. Epps 3d, the dean of students.\n\nThe Harvard-Radcliffe Gay and Lesbian Students Association has collected\n400 signatures on petitions requesting a\nforum with the Navy and is expected to\nhave the required number soon, said\nGeorge A. Broadwell, the chairman.\n\nThe 1973 Military Procurement Act\nstates ‘‘no funds appropriated for the\nDepartment of Defense may be used for\nany institution of higher learning if the\nSecretary of Defense or his designee\ndetermines that recruiting personnel of\nany of the armed forces are barred by\npolicy from the institution’s premises.’’\n\nReproduced with permission of the copyright owner. Further reproduction prohibited without permission."
Tokenize texts: To simply tokenize a text, quanteda provides a powerful command called tokens(). This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.
tokens(texts(doca_nyt_corpus)[2],remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE)
## Tokens consisting of 1 document.
## text2 :
## [1] "Homosexuals" "at" "Harvard" "Protesting" "Navy"
## [6] "Hiring" "New" "York" "Times" "1923-Current"
## [11] "file" "Mar"
## [ ... and 178 more ]
Constructing a document-feature matrix
Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing a document-feature matrix. For this, we have a Swiss-army knife function, called dfm(), which performs tokenization and tabulates the extracted features into a matrix of documents by features. Unlike the conservative approach taken by tokens(), the dfm() function applies certain options by default, such as tolower() – a separate function for lower-casing texts – and removes punctuation.
# make a dfm
<- dfm(doca_nyt_corpus,
my_dfm tolower = TRUE,
remove = stopwords("english"),
stem = TRUE,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
verbose = TRUE)
## Creating a dfm from a corpus input...
## ...lowercasing
## ...found 2,000 documents, 55,076 features
## ...removed 168 features
## ...stemming types (English)
## ...complete, elapsed time: 2.51 seconds.
## Finished constructing a 2,000 x 41,441 sparse dfm.
1:5] my_dfm[,
## Document-feature matrix of: 2,000 documents, 5 features (80.8% sparse) and 2 docvars.
## features
## docs dinkin lead call stop street
## text1 8 3 8 6 10
## text2 0 0 0 0 0
## text3 0 0 0 0 0
## text4 0 0 3 0 0
## text5 0 1 0 1 6
## text6 0 0 2 0 2
## [ reached max_ndoc ... 1,994 more documents ]
Viewing the document-feature matrix: The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View function. Calling plot on a dfm will display a wordcloud.
topfeatures(my_dfm, 20) # 20 top words
## said new time york mr permiss school student
## 10883 8439 6711 6660 5553 4398 3716 3523
## state citi group polic black without protest one
## 3372 2978 2916 2836 2728 2674 2620 2574
## copyright owner year prohibit
## 2485 2462 2407 2390
set.seed(100)
textplot_wordcloud(my_dfm, min_count = 6, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
Latent Dirichlet Allocation
We will start with the simple latent dirichlet allocation with gibbs sampling. This part is adapted from Ethen Liu’s intuitive demo to LDA. For detailed intuitive or tech intro, please check our lecture slides or Blei’s article. You can also check here for gibbs sampling.
Latent Dirichlet Allocation (LDA) is a probabilistic topic modeling method that allows us to discover and annotate texts. The key assumptions are as follows (see Mohr and Bogdanov 2013) .
Each document (text) within a corpus is viewed as a bag-of-words produced according to a mixture of themes that the author of the text intended to discuss. Each theme (or topic) is a distribution over all observed words in the corpus, such that words that are strongly associated with the document’s dominant topics have a higher chance of being selected and placed in the document bag. Thus, the goal of topic modeling is to find the parameters of the LDA process that has likely generated the corpus.
Based on this week’s reading (Blei 2012), we know that the topic distribution for each document is
\[ \theta \sim Dirichlet(\alpha) \]
Where \(Dirichlet(\alpha)\) denotes the Dirichlet distribution for parameter \(\alpha\).
The word distribution for each topic also modeled by a Dirichlet distribution with a different parameter \(\eta\).
\[ \beta \sim Dirichlet(\eta) \]
Our goal is to estimate the \(\theta\) and \(\beta\) using observed words in documents. That being said, we are trying to understand which words are important for which topic and which topics are important for a particular document, respectively.
Note that the Dirichlet distribution is a probability distribution for parameters \(\alpha\). Where \(\alpha\) governs the concentration of the distribution. Sometimes people call this concentration parameter or scaling parameter. When \(\alpha\) approaches 0, it means documents are concentrated in a few topics. So a higher value suggests that topics are more evenly distributed across the documents. This also applied to \(\beta\) regarding topic-word.
We will use Gibbs sampling to compute the conditional probability specified in Blei’s article (eq 2). Generally speaking, LDA is a generative model of word counts. We are interested in the conditional probability of hidden topic structure given the observed words in documents.
To simplify the demo process, we will use 10 short strings to represent 10 documents (Note that recent study shows that the length of document and the number of documents do influence our results. Just be careful about this). We deliberately get 5 sentences describing Chinese food and 5 sentences describing American football from Wikipedia.
Usually before running topic model, we need to normalize our texts as shown in our lecture (like tidy texts, removing stop words, white-spaces, etc.). We often use tidytext, tm, or quanteda packages in R to preprocess the texts, but now let us stick to basic stuff. I strongly suggest you to take some time to read the quanteda tutorial.
<- c(
raw_docs "Chinese cuisine is an important part of Chinese culture, which includes cuisine originating from the diverse regions of China, as well as from Overseas Chinese who have settled in other parts of the world.",
"The preference for seasoning and cooking techniques of Chinese provinces depend on differences in historical background and ethnic groups.",
"Chinese society greatly valued gastronomy, and developed an extensive study of the subject based on its traditional medical beliefs.",
"There are a variety of styles of cooking in China, but Chinese chefs have classified eight regional cuisines according to their distinct tastes and local characteristics. ",
"Based on the raw materials and ingredients used, the method of preparation and cultural differences, a variety of foods with different flavors and textures are prepared in different regions of the country. ",
"American football, referred to as football in the United States and Canada and also known as gridiron,is a team sport played by two teams of eleven players on a rectangular field with goalposts at each end",
"American football evolved in the United States, originating from the sports of soccer and rugby. The first American football match was played on November 6, 1869, between two college teams, Rutgers and Princeton, using rules based on the rules of soccer at the time.",
"American football is the most popular sport in the United States. The most popular forms of the game are professional and college football, with the other major levels being high school and youth football. ",
"In football, the winner is the team that has scored more points at the end of the game. There are multiple ways to score in a football game. ",
"Football games last for a total of 60 minutes in professional and college play and are divided into two halves of 30 minutes and four quarters of 15 minutes."
)
# lower cases and remove punctuation or double spaces
<- stringr::str_replace_all(tolower(raw_docs),"[:punct:]","")
raw_docs
# remove stop words
= paste(stopwords::stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
stopwords_regex <- stringr::str_replace_all(raw_docs,stopwords_regex, '')
raw_docs
# remove the most frequent words, chinese,american, football
<- stringr::str_replace_all(raw_docs,"chinese|american|football", '')
raw_docs 1]] raw_docs[[
## [1] " cuisine important part culture includes cuisine originating diverse regions china well overseas settled parts world"
# let us squish our text, removing extra spaces
<- stringr::str_squish(raw_docs)
raw_docs
# segmenting each work, similar to tokenization.
<- strsplit(raw_docs, split = " ")
docs 1]] docs[[
## [1] "cuisine" "important" "part" "culture" "includes"
## [6] "cuisine" "originating" "diverse" "regions" "china"
## [11] "well" "overseas" "settled" "parts" "world"
# get a vocabulary of unique words in our corpus
<- unique( unlist(docs) )
vocab
# represent strings using numerical numbers
# use the base match function match(x,table)
# If x[i] is found to equal table[j] then the value returned in the i-th position of the return value is j, for the smallest possible j.
for( i in 1:length(docs) ) {
<- match( docs[[i]], vocab )
docs[[i]]
} docs
## [[1]]
## [1] 1 2 3 4 5 1 6 7 8 9 10 11 12 13 14
##
## [[2]]
## [1] 15 16 17 18 19 20 21 22 23 24 25
##
## [[3]]
## [1] 26 27 28 29 30 31 32 33 34 35 36 37
##
## [[4]]
## [1] 38 39 17 9 40 41 42 43 44 45 46 47 48 49
##
## [[5]]
## [1] 34 50 51 52 53 54 55 56 21 38 57 58 59 60 61 58 8 62
##
## [[6]]
## [1] 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##
## [[7]]
## [1] 81 64 65 6 82 83 84 85 86 72 87 88 89 73 90 74 91 92 93 94 34 94 83 95
##
## [[8]]
## [1] 96 71 64 65 96 97 98 99 90 100 101 102 103 104
##
## [[9]]
## [1] 105 70 106 107 80 98 108 109 110 98
##
## [[10]]
## [1] 111 112 113 114 115 99 90 116 117 73 118 119 115 120 121 122 115
In LDA, we have to specify the number of clusters (i.e., topics) first. Usually it was denoted by K. In this case, let us do 2.
If we recall correctly, in Blei’s article, he described the generative process of LDA. It has several major steps.
::include_graphics('figure1.png') knitr
Here, let us first go through each document and randomly assign each word in the document to one of the K topics. This is the topic assignment process. The right side of Blei’s article in Figure 1.
Then we create a word-topic matrix, which is the count of each word being assigned to each topic. And a document-topic matrix, which is the number of words assigned to each topic for each document.
# cluster number
<- 2
K
# initialize count matrices
# @wt : word-topic matrix
<- matrix( 0, nrow = K, ncol = length(vocab) )
wt colnames(wt) <- vocab
# @ta : topic assignment list
<- lapply( docs, function(x) rep( 0, length(x) ) )
ta names(ta) <- paste0( "doc", 1:length(docs) )
# @dt : counts correspond to the number of words assigned to each topic for each document
<- matrix( 0, length(docs), K )
dt
set.seed(2022)
for( d in 1:length(docs) ) {
# randomly assign topic to word w
for( w in 1:length( docs[[d]] ) ) {
<- sample(1:K, 1)
ta[[d]][w]
# extract the topic index, word id and update the corresponding cell in the word-topic count matrix
<- ta[[d]][w]
ti <- docs[[d]][w]
wi <- wt[ti, wi] + 1
wt[ti, wi] # josh's comments- the initial value for wt[ti,wi] is 0, and now we update it to 1 because we assign a word to that topic. so the count of words increases to 1.
}
# count words in document d assigned to each topic t
# Josh's comment-okay, dt is a container for topic-document count
for( t in 1:K ) {
<- sum( ta[[d]] == t )
dt[d, t]
}
}
# randomly assigned topic to each word
print(ta)
## $doc1
## [1] 2 1 2 1 1 2 2 1 2 2 1 1 2 2 1
##
## $doc2
## [1] 1 2 1 2 2 1 1 1 2 1 2
##
## $doc3
## [1] 2 1 1 2 1 2 2 2 1 1 2 1
##
## $doc4
## [1] 1 2 1 2 2 2 1 2 1 2 2 2 1 2
##
## $doc5
## [1] 1 1 2 1 1 1 1 2 1 2 1 1 2 2 1 1 2 1
##
## $doc6
## [1] 2 1 2 2 2 1 2 1 2 2 2 2 1 1 1 2 1 2
##
## $doc7
## [1] 2 1 2 2 2 1 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 1 1 1
##
## $doc8
## [1] 1 2 2 1 1 1 1 1 1 2 1 2 1 2
##
## $doc9
## [1] 2 2 2 1 1 2 1 2 2 1
##
## $doc10
## [1] 2 2 2 1 1 1 2 1 1 1 2 1 1 1 1 1 2
print(wt)
## cuisine important part culture includes originating diverse regions china
## [1,] 0 1 0 1 1 0 1 0 0
## [2,] 2 0 1 0 0 2 0 2 2
## well overseas settled parts world preference seasoning cooking techniques
## [1,] 1 1 0 0 1 1 0 2 0
## [2,] 0 0 1 1 0 0 1 0 1
## provinces depend differences historical background ethnic groups society
## [1,] 0 1 2 1 0 1 0 0
## [2,] 1 0 0 0 1 0 1 1
## greatly valued gastronomy developed extensive study subject based
## [1,] 1 1 0 1 0 0 0 2
## [2,] 0 0 1 0 1 1 1 1
## traditional medical beliefs variety styles chefs classified eight regional
## [1,] 1 0 1 1 0 0 0 1 0
## [2,] 0 1 0 1 1 1 1 0 1
## cuisines according distinct tastes local characteristics raw materials
## [1,] 1 0 0 0 1 0 1 0
## [2,] 0 1 1 1 0 1 0 1
## ingredients used method preparation cultural foods different flavors
## [1,] 1 1 1 1 0 1 2 0
## [2,] 0 0 0 0 1 0 0 1
## textures prepared country referred united states canada also known
## [1,] 0 1 1 0 2 1 0 0 1
## [2,] 1 0 0 1 1 2 1 1 0
## gridironis team sport played two teams eleven players rectangular field
## [1,] 0 1 0 0 1 0 1 1 1 0
## [2,] 1 1 2 2 2 2 0 0 0 1
## goalposts end evolved sports soccer rugby first match november 6 1869
## [1,] 1 1 0 0 2 0 1 0 1 1 0
## [2,] 0 1 1 1 0 1 0 1 0 0 1
## college rutgers princeton using rules time popular forms game professional
## [1,] 2 1 1 0 1 1 2 1 2 2
## [2,] 1 0 0 1 1 0 0 0 1 0
## major levels high school youth winner scored points multiple ways score
## [1,] 0 1 0 1 0 0 0 1 1 0 0
## [2,] 1 0 1 0 1 1 1 0 0 1 1
## games last total 60 minutes play divided halves 30 four quarters 15
## [1,] 0 0 0 1 2 1 1 0 1 1 1 1
## [2,] 1 1 1 0 1 0 0 1 0 0 0 0
print(dt)
## [,1] [,2]
## [1,] 7 8
## [2,] 6 5
## [3,] 6 6
## [4,] 5 9
## [5,] 12 6
## [6,] 7 11
## [7,] 11 13
## [8,] 9 5
## [9,] 4 6
## [10,] 11 6
Notice that this random assignment gives you both the topic representations of all the documents and word distributions of all the topics (bad ones!!!). We need to improve this!! Optimize it!
There are a couple of ways to do this. But we focus on Gibbs Sampling method that performs the following steps for a user-specified iteration:
For each document d, go through each word w. Reassign a new topic to w from topic t with “the probability of word w given topic t” \(\times\) “probability of topic t given document d”, denoted by the following mathematical notations:
\[ P( z_i = j \text{ }| \text{ } z_{-i}, w_i, d_i ) \propto \frac{ C^{WT}_{w_ij} + \eta }{ \sum^W_{ w = 1 }C^{WT}_{wj} + W\eta } \times \frac{ C^{DT}_{d_ij} + \alpha }{ \sum^T_{ t = 1 }C^{DT}_{d_it} + T\alpha } \]
This formula is confusing! Let us talk bit by bit.
Starting from the left side of the equal sign:
- \(P(z_i = j)\) : The probability that token i is assigned to topic j.
- \(z_{-i}\) : Represents topic assignments of all other tokens.
- \(w_i\) : Word (index) of the \(i_{th}\) token.
- \(d_i\) : Document containing the \(i_{th}\) token.
For the right side of the equal sign:
- \(C^{WT}\) : Word-topic matrix, the
wt
matrix we generated. - \(\sum^W_{ w = 1 }C^{WT}_{wj}\) : Total number of tokens (words) in each topic.
- \(C^{DT}\) : Document-topic matrix, the
dt
matrix we generated. - \(\sum^T_{ t = 1 }C^{DT}_{d_it}\) : Total number of tokens (words) in document i.
- \(\eta\) : Parameter that sets the topic distribution for the words, the higher the more spread out the words will be across the specified number of topics (K).
- \(\alpha\) : Parameter that sets the topic distribution for the documents, the higher the more spread out the documents will be across the specified number of topics (K).
- \(W\) : Total number of words in the set of documents.
- \(T\) : Number of topics, equivalent of the K we defined earlier.
# parameters
<- 1
alpha <- 1
eta
# initial topics assigned to the first word of the first document
# and its corresponding word id
<- ta[[1]][1]
t0 <- docs[[1]][1]
wid
# z_-i means that we do not include token w in our word-topic and document-topic count matrix when sampling for token w, only leave the topic assignments of all other tokens for document 1
1, t0] <- dt[1, t0] - 1
dt[<- wt[t0, wid] - 1
wt[t0, wid]
# Calculate left side and right side of equal sign
<- ( wt[, wid] + eta ) / ( rowSums(wt) + length(vocab) * eta )
left <- ( dt[1, ] + alpha ) / ( sum( dt[1, ] ) + K * alpha )
right
# draw new topic for the first word in the first document
# The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero.
<- sample(1:K, 1, prob = left * right)
t1 t1
## [1] 1
After the first iteration, the topic for the first word in the first document is updated to 1.Just remember after drawing the new topic we need to update the topic assignment list with newly sampled topic for token w; re-increment the word-topic and document-topic count matrices with the new sampled topic for token w.
We will use Ethen Liu’s user-written function [
LDA1
][LDA] as a demo to run some interations, which takes the parameters of:
docs
Document that have be converted to token (word) ids.vocab
Unique tokens (words) for all the document collection.K
Number of topic groups.alpha
andeta
Distribution parameters as explained earlier.iterations
Number of iterations to run gibbs sampling to train our model.- Returns a list containing the final weight-topic count matrix
wt
and document-topic matrixdt
.
# define parameters
<- 2
K <- 1
alpha <- 0.001
eta <- 1000
iterations
source("LDA_functions.R")
set.seed(2022)
<- LDA1( docs = docs, vocab = vocab,
lda1 K = K, alpha = alpha, eta = eta, iterations = iterations )
lda1
## $wt
## cuisine important part culture includes originating diverse regions china
## [1,] 0 0 0 1 0 2 0 0 0
## [2,] 2 1 1 0 1 0 1 2 2
## well overseas settled parts world preference seasoning cooking techniques
## [1,] 0 0 0 0 0 0 0 0 0
## [2,] 1 1 1 1 1 1 1 2 1
## provinces depend differences historical background ethnic groups society
## [1,] 1 0 0 1 0 1 0 0
## [2,] 0 1 2 0 1 0 1 1
## greatly valued gastronomy developed extensive study subject based
## [1,] 1 1 1 1 1 1 1 3
## [2,] 0 0 0 0 0 0 0 0
## traditional medical beliefs variety styles chefs classified eight regional
## [1,] 1 1 1 0 0 0 0 0 0
## [2,] 0 0 0 2 1 1 1 1 1
## cuisines according distinct tastes local characteristics raw materials
## [1,] 0 0 0 0 0 0 0 0
## [2,] 1 1 1 1 1 1 1 1
## ingredients used method preparation cultural foods different flavors
## [1,] 0 0 0 0 0 0 0 0
## [2,] 1 1 1 1 1 1 2 1
## textures prepared country referred united states canada also known
## [1,] 0 0 0 1 3 3 1 1 1
## [2,] 1 1 1 0 0 0 0 0 0
## gridironis team sport played two teams eleven players rectangular field
## [1,] 1 2 2 2 3 2 1 1 1 1
## [2,] 0 0 0 0 0 0 0 0 0 0
## goalposts end evolved sports soccer rugby first match november 6 1869
## [1,] 1 2 0 1 0 0 0 0 1 1 1
## [2,] 0 0 1 0 2 1 1 1 0 0 0
## college rutgers princeton using rules time popular forms game professional
## [1,] 3 0 0 1 2 0 2 1 3 2
## [2,] 0 1 1 0 0 1 0 0 0 0
## major levels high school youth winner scored points multiple ways score
## [1,] 0 1 0 1 1 1 1 1 1 1 1
## [2,] 1 0 1 0 0 0 0 0 0 0 0
## games last total 60 minutes play divided halves 30 four quarters 15
## [1,] 0 1 0 1 3 1 1 1 1 1 1 0
## [2,] 1 0 1 0 0 0 0 0 0 0 0 1
##
## $dt
## [,1] [,2]
## [1,] 2 13
## [2,] 3 8
## [3,] 11 1
## [4,] 0 14
## [5,] 1 17
## [6,] 18 0
## [7,] 15 9
## [8,] 12 2
## [9,] 10 0
## [10,] 14 3
After we’re done with learning the topics for 1000 iterations, we can use the count matrices to obtain the word-topic distribution and document-topic distribution.
To compute the probability of word given topic:
\[\beta_{ij} = \frac{C^{WT}_{ij} + \eta}{\sum^W_{ k = 1 }C^{WT}_{kj} + W\eta}\]
Where \(\beta_{ij}\) is the probability of word i for topic j.
# topic probability of every word
<- ( lda1$wt + eta ) / ( rowSums(lda1$wt) + length(vocab) * eta ) ) ( beta
## cuisine important part culture includes
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.162305e-02 1.161143e-05
## [2,] 2.981139e-02 1.491314e-02 1.491314e-02 1.489824e-05 1.491314e-02
## originating diverse regions china well
## [1,] 2.323448e-02 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.489824e-05 1.491314e-02 2.981139e-02 2.981139e-02 1.491314e-02
## overseas settled parts world preference
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02
## seasoning cooking techniques provinces depend
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.162305e-02 1.161143e-05
## [2,] 1.491314e-02 2.981139e-02 1.491314e-02 1.489824e-05 1.491314e-02
## differences historical background ethnic groups
## [1,] 1.161143e-05 1.162305e-02 1.161143e-05 1.162305e-02 1.161143e-05
## [2,] 2.981139e-02 1.489824e-05 1.491314e-02 1.489824e-05 1.491314e-02
## society greatly valued gastronomy developed
## [1,] 1.161143e-05 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02
## [2,] 1.491314e-02 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## extensive study subject based traditional
## [1,] 1.162305e-02 1.162305e-02 1.162305e-02 3.484592e-02 1.162305e-02
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## medical beliefs variety styles chefs
## [1,] 1.162305e-02 1.162305e-02 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.489824e-05 1.489824e-05 2.981139e-02 1.491314e-02 1.491314e-02
## classified eight regional cuisines according
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02
## distinct tastes local characteristics raw
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02
## materials ingredients used method preparation
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02 1.491314e-02
## cultural foods different flavors textures
## [1,] 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.491314e-02 1.491314e-02 2.981139e-02 1.491314e-02 1.491314e-02
## prepared country referred united states
## [1,] 1.161143e-05 1.161143e-05 1.162305e-02 3.484592e-02 3.484592e-02
## [2,] 1.491314e-02 1.491314e-02 1.489824e-05 1.489824e-05 1.489824e-05
## canada also known gridironis team
## [1,] 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02 2.323448e-02
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## sport played two teams eleven
## [1,] 2.323448e-02 2.323448e-02 3.484592e-02 2.323448e-02 1.162305e-02
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## players rectangular field goalposts end
## [1,] 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02 2.323448e-02
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## evolved sports soccer rugby first
## [1,] 1.161143e-05 1.162305e-02 1.161143e-05 1.161143e-05 1.161143e-05
## [2,] 1.491314e-02 1.489824e-05 2.981139e-02 1.491314e-02 1.491314e-02
## match november 6 1869 college
## [1,] 1.161143e-05 1.162305e-02 1.162305e-02 1.162305e-02 3.484592e-02
## [2,] 1.491314e-02 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## rutgers princeton using rules time
## [1,] 1.161143e-05 1.161143e-05 1.162305e-02 2.323448e-02 1.161143e-05
## [2,] 1.491314e-02 1.491314e-02 1.489824e-05 1.489824e-05 1.491314e-02
## popular forms game professional major
## [1,] 2.323448e-02 1.162305e-02 3.484592e-02 2.323448e-02 1.161143e-05
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.491314e-02
## levels high school youth winner
## [1,] 1.162305e-02 1.161143e-05 1.162305e-02 1.162305e-02 1.162305e-02
## [2,] 1.489824e-05 1.491314e-02 1.489824e-05 1.489824e-05 1.489824e-05
## scored points multiple ways score
## [1,] 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## games last total 60 minutes
## [1,] 1.161143e-05 1.162305e-02 1.161143e-05 1.162305e-02 3.484592e-02
## [2,] 1.491314e-02 1.489824e-05 1.491314e-02 1.489824e-05 1.489824e-05
## play divided halves 30 four
## [1,] 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02 1.162305e-02
## [2,] 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05 1.489824e-05
## quarters 15
## [1,] 1.162305e-02 1.161143e-05
## [2,] 1.489824e-05 1.491314e-02
\[\theta_{dj} = \frac{C^{DT}_{dj} + \alpha}{\sum^T_{ k = 1 }C^{DT}_{dk} + T\alpha}\]
Where \(\theta_{dj}\) is the proportion of topic j in document d.
# topic probability of every document
<- ( lda1$dt + alpha ) / ( rowSums(lda1$dt) + K * alpha ) ) ( theta
## [,1] [,2]
## [1,] 0.1764706 0.82352941
## [2,] 0.3076923 0.69230769
## [3,] 0.8571429 0.14285714
## [4,] 0.0625000 0.93750000
## [5,] 0.1000000 0.90000000
## [6,] 0.9500000 0.05000000
## [7,] 0.6153846 0.38461538
## [8,] 0.8125000 0.18750000
## [9,] 0.9166667 0.08333333
## [10,] 0.7894737 0.21052632
Recall that LDA assumes that each document is a mixture of all topics, thus after computing the probability that each document belongs to each topic ( same goes for word & topic ) we can use this information to see which topic does each document belongs to and the more possible words that are associated with each topic. For more details on Gibbs Sampling, you can check Griffiths and Steyvers 2004 Finding Scientific topics.
# topic assigned to each document, the one with the highest probability
<- apply(theta, 1, which.max)
topic
# possible words under each topic
# sort the probability and obtain the user-specified number n
<- function(beta, n) {
Terms <- matrix(0, n, K)
term for( p in 1:nrow(beta) ) {
<- names( sort( beta[p, ], decreasing = TRUE )[1:n] )
term[, p]
}return(term)
}<- Terms(beta = beta, n = 2) term
We specified that we wanted to see the top 2 terms associated with each topic. The following section prints out the original raw document, which is grouped into 2 groups that we specified and words that are likely to go along with each topic.
list( original_text = raw_docs[topic == 1], words = term[, 1] )
## $original_text
## [1] "society greatly valued gastronomy developed extensive study subject based traditional medical beliefs"
## [2] "referred united states canada also known gridironis team sport played two teams eleven players rectangular field goalposts end"
## [3] "evolved united states originating sports soccer rugby first match played november 6 1869 two college teams rutgers princeton using rules based rules soccer time"
## [4] "popular sport united states popular forms game professional college major levels high school youth"
## [5] "winner team scored points end game multiple ways score game"
## [6] "games last total 60 minutes professional college play divided two halves 30 minutes four quarters 15 minutes"
##
## $words
## [1] "based" "united"
list( original_text = raw_docs[topic == 2], words = term[, 2] )
## $original_text
## [1] "cuisine important part culture includes cuisine originating diverse regions china well overseas settled parts world"
## [2] "preference seasoning cooking techniques provinces depend differences historical background ethnic groups"
## [3] "variety styles cooking china chefs classified eight regional cuisines according distinct tastes local characteristics"
## [4] "based raw materials ingredients used method preparation cultural differences variety foods different flavors textures prepared different regions country"
##
## $words
## [1] "cuisine" "regions"
The output tells us that the first topic seems to be discussing something about united states , while the second is something about food. It is still messy, not that intuitive. But at least it is a good starting point.
Now let us move to use the R library topicmodels to fit a LDA.
Since the starting point of gibbs sampling is chosen randomly, thus it makes sense to discard the first few iteration ( also known as
burn-in
periods ). Due to the fact that they most likely do not correctly reflect the properties of distribution. And another parameter isthin
, the number of iterations ommitted during the training. This serves to prevent correlations between samples during the iteration.We’ll use the
LDA
function from the topicmodels library to implement gibbs sampling method on the same set of raw documents and print out the result for you to compare. Note that library has a default of value of 50 / K for \(\alpha\) and 0.1 for \(\eta\).
# load packages if not installed, using install.packages("topicmodels")
library(tm)
library(topicmodels)
# @burnin : number of omitted Gibbs iterations at beginning
# @thin : number of omitted in-between Gibbs iterations
<- Corpus( VectorSource(raw_docs) )
docs1 <- DocumentTermMatrix(docs1)
dtm # josh'cc- the input of LDA is a document-term matrix. You can use tm::DocumentTermMatrix to create it. Note you can also use tidytext package to do this. You can also use quanteda to do this.
<- LDA( dtm, k = 2, method = "Gibbs",
lda control = list(seed = 2022, burnin = 500, thin = 100, iter = 4000) )
list( original_text = raw_docs[ topics(lda) == 1 ], words = terms(lda, 3)[, 1] )
## $original_text
## [1] "preference seasoning cooking techniques provinces depend differences historical background ethnic groups"
## [2] "variety styles cooking china chefs classified eight regional cuisines according distinct tastes local characteristics"
## [3] "referred united states canada also known gridironis team sport played two teams eleven players rectangular field goalposts end"
## [4] "winner team scored points end game multiple ways score game"
## [5] "games last total 60 minutes professional college play divided two halves 30 minutes four quarters 15 minutes"
##
## $words
## [1] "game" "minutes" "cuisine"
list( original_text = raw_docs[ topics(lda) == 2 ], words = terms(lda, 3)[, 2] )
## $original_text
## [1] "cuisine important part culture includes cuisine originating diverse regions china well overseas settled parts world"
## [2] "society greatly valued gastronomy developed extensive study subject based traditional medical beliefs"
## [3] "based raw materials ingredients used method preparation cultural differences variety foods different flavors textures prepared different regions country"
## [4] "evolved united states originating sports soccer rugby first match played november 6 1869 two college teams rutgers princeton using rules based rules soccer time"
## [5] "popular sport united states popular forms game professional college major levels high school youth"
##
## $words
## [1] "based" "states" "two"
Notice that after training the model for 4000 iterations and using a different \(\alpha\) and \(\eta\) value, we obtained a different document clustering result and different words that are more likely to associate with each topic. Since the goal here is to peform a clustering (unsupervised) method to unveil unknown patterns, the solutions will most likely differ as there is no such thing as a correct answer. We should try a range of different values of K to find the optimal topic grouping of the set of documents and see which result matches our intuition more.
Structural Topic Model
In this part we heavily rely on stm’s tutorial by Molly Roberts, Brandon Stewart and Dustin Tingley and an application by[Jula Silge] (https://juliasilge.com/blog/sherlock-holmes-stm/). We will go through their tutorial and show you how to do stm in R librabry stm.
Let us install stm first.
#library(devtools)
#install_github("bstewart/stm",dependencies=TRUE)
library(stm)
We use the data from stm tutorial, but we use our nyt dataset. http://reports-archive.adm.cs.cmu.edu/anon/ml2010/CMU-ML-10-101.pdf.
We need to merge with doca data to retrieve some meta data like publishing year of articles etc.
library(haven)
<- read_dta("./final_data_v10.dta",encoding = "latin1")
doca # because doca is event-level data, we want article level data
<- doca %>%
doca_article transmute(title=tolower(title),
rptyy=as.numeric(rptyy),
rptmm,section_a=ifelse(section%in%c("a","A"),1,0),
%>%
page,paragrph) distinct(title,.keep_all =TRUE)
<- doca_nyt %>%
data mutate(title_doca=tolower(title_doca)) %>%
left_join(doca_article,by=c("title_doca"="title"))
rm(doca_article)
rm(doca_nyt)
rm(doca)
Before we run topic models like lda, we need to preprocess data. STM provides several functions to automatically do stemming, stopwords removal, low frequency words removal, etc for you.
Of course, you can directly feed the created dtm to stm function as well. But let us use stm’s processor first.
Here is the graph of stm processors:
::include_graphics('figure2.png') knitr
Let us use the textProcessor to preprocess texts. Here is the function:
textProcessor(documents, metadata = NULL, lowercase = TRUE, removestopwords = TRUE, removenumbers = TRUE, removepunctuation = TRUE, ucp = FALSE, stem = TRUE, wordLengths = c(3, Inf), sparselevel = 1, language = “en”, verbose = TRUE, onlycharacter = FALSE, striphtml = FALSE, customstopwords = NULL, custompunctuation = NULL, v1 = FALSE)
#Preprocessing
#stemming/stopword removal, etc.
#Josh-cc, if you don't know the details of a function, you can use ? to check the documentation of that function. ?textProcessor
<- textProcessor(data$text, metadata=data) processed
## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Stemming...
## Creating Output...
Let us use prepDocuments to perform several corpus manipulations including removing words and renumbering word indices. here is the function:
prepDocuments(documents, vocab, meta = NULL, lower.thresh = 1, upper.thresh = Inf, subsample = NULL, verbose = TRUE)
#before running prepDocuments, you can use plotRemoved function to check the appropriate threshold to remove words or documents.
#take a look at how many words and documents would be removed with different lower.thresholds !!! check Error: could not find function "plotRemoved"
plotRemoved(processed$documents, lower.thresh=seq(1,200, by=100))
#structure and index for usage in the stm model. Verify no-missingness. can remove low frequency words using 'lower.thresh' option. See ?prepDocuments for more info
<- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh=1) out
## Removing 34336 of 52272 terms (34336 of 433059 tokens) due to frequency
## Your corpus now has 2000 documents, 17936 terms and 398723 tokens.
#output will have object meta, documents, and vocab
<- out$documents
docs <- out$vocab
vocab <-out$meta meta
Now, let us use stm function fit a stm model.
The function takes sparse representation of a document-term matrix, an integer number of topics, and covariates and returns fitted model parameters. Covariates can be used in the prior for topic prevalence, in the prior for topical content or both.
stm(documents, vocab, K, prevalence = NULL, content = NULL, data = NULL, init.type = c(“Spectral”, “LDA”, “Random”, “Custom”), seed = NULL, max.em.its = 500, emtol = 1e-05, verbose = TRUE, reportevery = 5, LDAbeta = TRUE, interactions = TRUE, ngroups = 1, model = NULL, gamma.prior = c(“Pooled”, “L1”), sigma.prior = 0, kappa.prior = c(“L1”, “Jeffreys”), control = list())
#run an stm model using the 'out' data. 20 topics. Asking how prevalaence of topics varies across documents' meta data, including publishing year. !! option s(year) applies a spline normalization to year variable. We also include a variable section A, whether it is published in section a.
# max.em.its should be at least 100. We use 100 just as demo
<- stm(out$documents,out$vocab,
stmFit K=20,
prevalence =~ section_a+s(rptyy),
max.em.its=100,
data=out$meta,
seed=2022,
verbose = FALSE)
Like LDA, stm also need to specify the number of topics or themes (K) before fitting. Fortunately, stm provides a function selectModel to help you select the models with high likelihood values.
selectModel(documents, vocab, K, prevalence = NULL, content = NULL, data = NULL, max.em.its = 100, verbose = TRUE, init.type = “LDA”, emtol = 1e-05, seed = NULL, runs = 50, frexw = 0.7, net.max.em.its = 2, netverbose = FALSE, M = 10, N = NULL, to.disk = F, …)
#let STM help you compare a number of models side by side. It will keep the models that don't stink (i.e. that converge quickly)
<- selectModel(out$documents,
stmSelect $vocab,
outK=20,
prevalence =~s(rptyy)+section_a,
max.em.its=20,# use 20 as a demo
data=meta,
runs=20,
seed=2022)
## Casting net
## 1 models in net
## 2 models in net
## 3 models in net
## 4 models in net
## 5 models in net
## 6 models in net
## 7 models in net
## 8 models in net
## 9 models in net
## 10 models in net
## 11 models in net
## 12 models in net
## 13 models in net
## 14 models in net
## 15 models in net
## 16 models in net
## 17 models in net
## 18 models in net
## 19 models in net
## 20 models in net
## Running select models
## 1 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.650)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.641, relative change = 1.055e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.635, relative change = 8.683e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.627, relative change = 9.730e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.619, relative change = 1.057e-03)
## Topic 1: school, board, children, teacher, high
## Topic 2: said, black, year, program, educ
## Topic 3: said, church, right, civil, leader
## Topic 4: street, park, one, war, anoth
## Topic 5: said, state, servic, year, new
## Topic 6: said, charg, offic, kill, two
## Topic 7: said, york, new, hospit, guard
## Topic 8: negro, hous, white, build, picket
## Topic 9: peopl, group, like, said, year
## Topic 10: citi, plan, communiti, peopl, resid
## Topic 11: protest, demonstr, march, new, york
## Topic 12: said, union, member, committe, group
## Topic 13: polic, offic, black, arrest, said
## Topic 14: york, time, permiss, new, file
## Topic 15: court, case, law, judg, state
## Topic 16: student, univers, colleg, campus, class
## Topic 17: prison, order, fire, counti, time
## Topic 18: presid, project, nation, justic, washington
## Topic 19: state, new, york, support, time
## Topic 20: said, aid, new, depart, administr
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.611, relative change = 1.110e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.602, relative change = 1.150e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.593, relative change = 1.215e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.583, relative change = 1.249e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.574, relative change = 1.254e-03)
## Topic 1: school, board, children, teacher, parent
## Topic 2: black, said, year, program, educ
## Topic 3: said, church, right, civil, leader
## Topic 4: street, park, one, war, east
## Topic 5: said, state, servic, year, increas
## Topic 6: said, charg, offic, kill, trial
## Topic 7: said, new, york, hospit, year
## Topic 8: negro, white, hous, build, picket
## Topic 9: peopl, like, group, said, year
## Topic 10: citi, plan, communiti, peopl, resid
## Topic 11: protest, demonstr, march, new, york
## Topic 12: said, member, union, group, committe
## Topic 13: polic, offic, arrest, black, said
## Topic 14: york, time, new, permiss, file
## Topic 15: court, case, law, judg, state
## Topic 16: student, univers, colleg, campus, faculti
## Topic 17: prison, fire, order, jail, chief
## Topic 18: presid, project, nation, washington, also
## Topic 19: state, new, york, support, polit
## Topic 20: said, depart, aid, administr, new
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.565, relative change = 1.232e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.556, relative change = 1.195e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.547, relative change = 1.150e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.539, relative change = 1.106e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 15 (approx. per word bound = -7.530, relative change = 1.062e-03)
## Topic 1: school, board, children, teacher, high
## Topic 2: black, said, year, program, report
## Topic 3: said, church, right, civil, rev
## Topic 4: street, park, war, one, ralli
## Topic 5: said, state, servic, increas, year
## Topic 6: said, charg, offic, kill, trial
## Topic 7: said, hospit, new, year, york
## Topic 8: negro, white, hous, picket, build
## Topic 9: peopl, said, like, year, group
## Topic 10: citi, plan, communiti, peopl, resid
## Topic 11: protest, demonstr, march, new, york
## Topic 12: said, member, group, union, committe
## Topic 13: polic, offic, said, arrest, black
## Topic 14: york, time, new, permiss, file
## Topic 15: court, case, law, judg, state
## Topic 16: student, univers, colleg, campus, faculti
## Topic 17: prison, fire, order, jail, chief
## Topic 18: presid, nation, washington, project, hous
## Topic 19: state, new, york, support, polit
## Topic 20: said, depart, aid, meet, new
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 16 (approx. per word bound = -7.523, relative change = 1.027e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 17 (approx. per word bound = -7.515, relative change = 9.936e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 18 (approx. per word bound = -7.508, relative change = 9.545e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 19 (approx. per word bound = -7.501, relative change = 9.135e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
## 2 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.649)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.641, relative change = 1.017e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.635, relative change = 8.320e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.628, relative change = 9.426e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.620, relative change = 1.041e-03)
## Topic 1: black, said, minor, member, issu
## Topic 2: negro, time, permiss, york, new
## Topic 3: said, peopl, hous, see, live
## Topic 4: protest, demonstr, student, demand, build
## Topic 5: student, univers, colleg, said, campus
## Topic 6: law, state, case, rule, year
## Topic 7: court, right, feder, judg, state
## Topic 8: school, board, educ, high, teacher
## Topic 9: said, new, get, one, town
## Topic 10: charg, prison, time, offic, investig
## Topic 11: citi, plan, said, program, new
## Topic 12: nation, unit, group, american, new
## Topic 13: said, communiti, resid, fire, citi
## Topic 14: said, york, new, permiss, union
## Topic 15: white, black, last, citi, mayor
## Topic 16: new, time, york, state, work
## Topic 17: york, new, time, group, presid
## Topic 18: said, year, say, use, last
## Topic 19: church, women, year, member, mani
## Topic 20: polic, street, arrest, march, offic
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.612, relative change = 1.113e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.603, relative change = 1.169e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.593, relative change = 1.211e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.584, relative change = 1.214e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.575, relative change = 1.188e-03)
## Topic 1: black, said, minor, member, issu
## Topic 2: negro, time, permiss, york, new
## Topic 3: said, peopl, hous, see, can
## Topic 4: protest, demonstr, demand, student, build
## Topic 5: student, univers, colleg, campus, said
## Topic 6: law, case, state, rule, year
## Topic 7: court, feder, right, judg, state
## Topic 8: school, board, educ, teacher, high
## Topic 9: said, new, get, one, town
## Topic 10: charg, prison, offic, time, investig
## Topic 11: citi, plan, said, program, new
## Topic 12: nation, group, unit, state, right
## Topic 13: said, communiti, resid, fire, brooklyn
## Topic 14: said, york, new, permiss, union
## Topic 15: white, black, citi, last, mayor
## Topic 16: new, time, york, state, work
## Topic 17: york, new, time, group, presid
## Topic 18: say, use, year, said, last
## Topic 19: church, women, year, member, mani
## Topic 20: polic, arrest, street, march, offic
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.566, relative change = 1.165e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.558, relative change = 1.139e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.549, relative change = 1.101e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.541, relative change = 1.062e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 15 (approx. per word bound = -7.534, relative change = 1.026e-03)
## Topic 1: black, said, member, minor, group
## Topic 2: time, negro, new, york, permiss
## Topic 3: said, peopl, hous, see, can
## Topic 4: protest, demonstr, demand, build, south
## Topic 5: student, univers, colleg, campus, said
## Topic 6: law, state, case, rule, year
## Topic 7: court, feder, judg, right, state
## Topic 8: school, board, educ, teacher, high
## Topic 9: said, new, get, one, town
## Topic 10: charg, prison, offic, time, investig
## Topic 11: citi, plan, said, program, new
## Topic 12: nation, group, unit, state, right
## Topic 13: said, communiti, resid, build, fire
## Topic 14: said, york, new, permiss, time
## Topic 15: white, black, citi, night, mayor
## Topic 16: new, time, york, state, indian
## Topic 17: york, new, time, group, presid
## Topic 18: use, say, area, last, year
## Topic 19: church, women, year, member, mani
## Topic 20: polic, arrest, street, march, said
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 16 (approx. per word bound = -7.526, relative change = 9.921e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 17 (approx. per word bound = -7.519, relative change = 9.594e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 18 (approx. per word bound = -7.512, relative change = 9.242e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 19 (approx. per word bound = -7.505, relative change = 8.858e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
## 3 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.650)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.642, relative change = 1.035e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.635, relative change = 8.411e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.628, relative change = 9.505e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.620, relative change = 1.057e-03)
## Topic 1: church, men, york, time, one
## Topic 2: said, offic, two, report, time
## Topic 3: york, new, permiss, time, protest
## Topic 4: negro, white, picket, march, time
## Topic 5: said, one, last, hous, year
## Topic 6: say, peopl, aid, new, said
## Topic 7: demonstr, ralli, protest, march, war
## Topic 8: student, univers, campus, demand, faculti
## Topic 9: polic, arrest, street, charg, youth
## Topic 10: said, hous, home, year, high
## Topic 11: court, case, feder, judg, right
## Topic 12: state, law, rule, prison, abort
## Topic 13: school, board, educ, children, teacher
## Topic 14: resid, strike, offici, use, counti
## Topic 15: citi, build, street, park, avenu
## Topic 16: group, nation, organ, presid, new
## Topic 17: black, said, white, communiti, racial
## Topic 18: state, plan, new, year, increas
## Topic 19: colleg, said, women, program, depart
## Topic 20: said, union, compani, new, council
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.611, relative change = 1.156e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.602, relative change = 1.251e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.592, relative change = 1.327e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.581, relative change = 1.371e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.571, relative change = 1.382e-03)
## Topic 1: church, men, york, peopl, chicago
## Topic 2: said, offic, report, two, fire
## Topic 3: york, new, time, permiss, protest
## Topic 4: negro, white, picket, today, march
## Topic 5: said, one, year, last, live
## Topic 6: say, peopl, aid, said, like
## Topic 7: demonstr, ralli, protest, march, war
## Topic 8: student, univers, campus, faculti, demand
## Topic 9: polic, arrest, street, youth, charg
## Topic 10: said, hous, home, high, year
## Topic 11: court, case, feder, judg, right
## Topic 12: state, law, rule, prison, abort
## Topic 13: school, board, educ, children, teacher
## Topic 14: resid, strike, counti, offici, use
## Topic 15: citi, build, street, park, avenu
## Topic 16: group, nation, organ, presid, right
## Topic 17: black, white, said, communiti, racial
## Topic 18: state, plan, new, year, increas
## Topic 19: colleg, said, women, program, depart
## Topic 20: said, union, compani, new, polici
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.560, relative change = 1.371e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.550, relative change = 1.363e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.540, relative change = 1.352e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.530, relative change = 1.289e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 15 (approx. per word bound = -7.521, relative change = 1.192e-03)
## Topic 1: church, men, chicago, peopl, jewish
## Topic 2: said, offic, report, fire, incid
## Topic 3: york, new, time, permiss, protest
## Topic 4: negro, white, picket, today, march
## Topic 5: said, one, year, last, live
## Topic 6: say, peopl, aid, said, like
## Topic 7: demonstr, ralli, protest, march, war
## Topic 8: student, univers, colleg, campus, faculti
## Topic 9: polic, arrest, street, youth, charg
## Topic 10: said, hous, home, year, want
## Topic 11: court, case, feder, judg, right
## Topic 12: state, law, rule, prison, abort
## Topic 13: school, board, teacher, educ, parent
## Topic 14: resid, strike, counti, plant, nuclear
## Topic 15: citi, build, street, park, avenu
## Topic 16: group, nation, organ, presid, right
## Topic 17: black, white, said, communiti, racial
## Topic 18: state, plan, new, increas, year
## Topic 19: said, women, program, depart, member
## Topic 20: said, union, compani, new, polici
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 16 (approx. per word bound = -7.513, relative change = 1.104e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 17 (approx. per word bound = -7.505, relative change = 1.041e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 18 (approx. per word bound = -7.498, relative change = 9.986e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 19 (approx. per word bound = -7.490, relative change = 9.628e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
## 4 select model run
## Beginning LDA Initialization
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -7.651)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -7.643, relative change = 1.016e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -7.636, relative change = 8.522e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -7.629, relative change = 9.681e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -7.621, relative change = 1.056e-03)
## Topic 1: church, york, call, group, time
## Topic 2: court, case, right, decis, justic
## Topic 3: say, one, member, like, can
## Topic 4: will, said, plan, new, govern
## Topic 5: said, state, women, rule, program
## Topic 6: school, board, educ, high, children
## Topic 7: compani, committe, organ, public, leader
## Topic 8: group, law, said, york, support
## Topic 9: street, said, avenu, youth, yesterday
## Topic 10: student, univers, colleg, campus, black
## Topic 11: judg, said, counti, charg, time
## Topic 12: will, meet, citi, teacher, mayor
## Topic 13: new, york, nation, right, time
## Topic 14: said, home, area, new, time
## Topic 15: citi, build, prison, offic, new
## Topic 16: polic, black, white, offic, said
## Topic 17: said, peopl, hous, year, resid
## Topic 18: state, new, said, abort, unit
## Topic 19: time, permiss, york, negro, new
## Topic 20: demonstr, protest, arrest, march, today
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -7.612, relative change = 1.145e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -7.603, relative change = 1.197e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -7.594, relative change = 1.196e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -7.585, relative change = 1.171e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -7.577, relative change = 1.147e-03)
## Topic 1: church, call, york, group, ask
## Topic 2: court, case, right, decis, justic
## Topic 3: say, one, member, like, can
## Topic 4: will, new, plan, govern, said
## Topic 5: said, state, women, rule, program
## Topic 6: school, board, educ, children, high
## Topic 7: committe, compani, public, organ, action
## Topic 8: group, law, said, support, american
## Topic 9: street, said, avenu, youth, yesterday
## Topic 10: student, univers, colleg, campus, black
## Topic 11: judg, said, counti, charg, time
## Topic 12: citi, will, meet, teacher, plan
## Topic 13: new, york, nation, right, time
## Topic 14: said, home, area, new, year
## Topic 15: citi, build, prison, offic, strike
## Topic 16: polic, black, white, offic, said
## Topic 17: said, peopl, hous, year, resid
## Topic 18: state, new, abort, said, unit
## Topic 19: time, york, permiss, new, negro
## Topic 20: demonstr, protest, arrest, march, today
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -7.568, relative change = 1.124e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -7.560, relative change = 1.104e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -7.551, relative change = 1.092e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -7.543, relative change = 1.082e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 15 (approx. per word bound = -7.535, relative change = 1.063e-03)
## Topic 1: church, call, york, council, ask
## Topic 2: court, case, right, decis, law
## Topic 3: say, one, member, like, can
## Topic 4: will, new, govern, plan, use
## Topic 5: state, said, women, program, suit
## Topic 6: school, board, educ, parent, children
## Topic 7: committe, public, compani, organ, leader
## Topic 8: group, said, support, american, york
## Topic 9: street, said, avenu, youth, yesterday
## Topic 10: student, univers, colleg, campus, black
## Topic 11: judg, said, counti, charg, time
## Topic 12: citi, will, meet, plan, mayor
## Topic 13: new, nation, right, york, unit
## Topic 14: said, home, area, new, year
## Topic 15: citi, build, prison, offic, strike
## Topic 16: polic, black, white, said, offic
## Topic 17: said, peopl, hous, year, resid
## Topic 18: state, new, abort, said, unit
## Topic 19: time, york, permiss, new, negro
## Topic 20: demonstr, protest, arrest, march, today
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 16 (approx. per word bound = -7.527, relative change = 1.036e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 17 (approx. per word bound = -7.520, relative change = 1.007e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 18 (approx. per word bound = -7.512, relative change = 9.793e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 19 (approx. per word bound = -7.505, relative change = 9.524e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Model Terminated Before Convergence Reached
#plot the different models that make the cut along exclusivity and semantic coherence of their topics
plotModels(stmSelect)
#the 3rd one looks best, so choose it and give it the name stmFit
<-stmSelect$runout[[3]] #choose the third model stmFit
Now it is time to interpret the stm model.
###LIST OF TOP WORDS for topics 1, 7, & 10
labelTopics(stmFit, c(1, 7, 10))
## Topic 1 Top Words:
## Highest Prob: church, chicago, said, jewish, cathol, peopl, religi
## FREX: church, congreg, jewish, pastor, synagogu, rabbi, lutheran
## Lift: schindler, carv, gahagen, schism, passov, theologian, judaism
## Score: church, jewish, chicago, jew, rabbi, christian, religi
## Topic 7 Top Words:
## Highest Prob: demonstr, ralli, protest, march, war, peac, member
## FREX: ralli, vietnam, parad, antiwar, peac, slogan, premier
## Lift: counterpicket, —assail, ‘red, “face, crossbon, demonstrator’, dgugla
## Score: demonstr, ralli, protest, war, vietnam, peac, march
## Topic 10 Top Words:
## Highest Prob: said, hous, home, week, want, citi, year
## FREX: don’t, know, queen, home, mrs, poor, nurs
## Lift: anymor, “town, ‘‘’m, polo, yank, suburbia, sinkler
## Score: hous, home, said, mayor, mrs, don’t, want
Let us do a wordcloud, but I am not suggesting you to do this in your published research.
###WORDCLOUD for a specified TOPIC
cloud(stmFit, topic=7)
Let us find some texts that are most representative for a particular topic using findThoughts function:
Outputs most representative documents for a particular topic. Use this in order to get a better sense of the content of actual documents with a high topical content.
findThoughts(model, texts = NULL, topics = NULL, n = 3, thresh = NULL, where = NULL, meta = NULL)
#object 'thoughts1' contains 2 documents about topic 1. 'texts=shortdoc,' gives you just the first 250 words
<- data %>%
data mutate(shortdoc=text %>%
str_replace_all("\\n","") %>%
str_extract("^.{250}"))
<- findThoughts(stmFit,
thoughts1 texts=data$shortdoc,
n=2,
topics=1)$docs[[1]]
#will show you the output
plotQuote(thoughts1, width=40, main="Topic 1")
Let use find more documents for topics
#how about more documents for more of these topics?
<- findThoughts(stmFit,
thoughts7 texts=data$shortdoc,
n=2,
topics=7)$docs[[1]]
<- findThoughts(stmFit,
thoughts10 texts=data$shortdoc,
n=2,
topics=10)$docs[[1]]
<- findThoughts(stmFit,
thoughts4 texts=data$shortdoc,
n=2,
topics=4)$docs[[1]]
#And in a 2X2 table? We like 2X2 tables! --- Note: this command will force all remaining plots into a 2X2 table format
par(mfrow = c(2, 2),mar=c(.5,.5,1,.5))
plotQuote(thoughts1, width=50, main="Topic 1")
plotQuote(thoughts4, width=50, main="Topic 4")
plotQuote(thoughts7, width=50, main="Topic 7")
plotQuote(thoughts10, width=50, main="Topic 10")
Let us see PROPORTION OF EACH TOPIC in the entire CORPUS.
## Just insert your STM output
plot.STM(stmFit, type="summary", n=5,xlim=c(0,.4))
Let us see how topics are correlated…
##see GRAPHICAL NETWORK DISPLAY of how closely related topics are to one another, (i.e., how likely they are to appear in the same document) Requires 'igraph' package
<-topicCorr(stmFit)
mod.out.corrplot.topicCorr(mod.out.corr)
Let use see topical content by covariates
##VISUALIZE DIFFERENCES BETWEEN TWO DIFFERENT TOPICS using the ,type="perspectives" option
plot.STM(stmFit,type="perspectives", topics=c(9, 10))
Let see how prevalence of topics varies across documents based on document covariates.
###See CORRELATIONS BTWN METADATA & TOPIC PREVALANCE in documents
###First, must estimate an effect of the metadata covariates on topic prevalence in a document, so that we have anything to plot
#since we're preparing these coVariates by estimating their effects we call these estimated effects 'prep'
#we're estimating Effects across all 20 topics, 1:20. We're using 'section_a' and normalized 'rptyy,' using the topic model stmFit.
#The meta data file we call meta. We are telling it to generate the model while accounting for all possible uncertainty. Note: when estimating effects of one covariate, others are held at their mean
<- estimateEffect(1:20 ~ section_a+s(rptyy),stmFit,meta=meta, uncertainty = "Global") prep
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
## Consider formula(paste(x, collapse = " ")) instead.
###See how PREVALENCE of TOPICS DIFFERS across VALUES of a CATEGORICAL COVARIATE
plot.estimateEffect(prep, covariate = "section_a", topics = seq(1,20,1),
#topic model=stmFit. Method="difference"
model=stmFit, method="difference",
#only using two values of covariate, and labeling them... assume we could do this with a non-binary covariate and just specify
cov.value1="0",cov.value2="1",
xlab="Not section a ... section a",
main="Effect of publishing in section a",
xlim=c(-.1,.1), labeltype = "custom")
#See how PREVALENCE of TOPICS DIFFERS across VALUES of a CONTINUOUS COVARIATE
#plotting prep data on day variable, a continuous variable with a continous plot. focusing on topic 9.!
plot.estimateEffect(prep, "rptyy", method="continuous", topics=9,
printlegend=FALSE, xaxt="n", xlab="Time(1965-1995)")
Let us see how words of the topics are emphasized differently across documents according to document covariates
#### Instead of looking at how prevalent a topic is in a class of documents categorized by meta-data covariate...
#### ... let's see how the words of the topic are emphasized differently in documents of each category of the covariate
##First, we we estimate a new stm. It's the same as the old one, including prevalence option, but we add in a content option
<- stm(out$documents,out$vocab,K=20,
stmContent prevalence =~ section_a+ s(rptyy), content=~section_a,
max.em.its=20, data=out$meta,seed=2022)
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Using only 10000 most frequent terms during initialization...
## Finding anchor words...
## ....................
## Recovering initialization...
## ....................................................................................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (41 seconds).
## Completing Iteration 1 (approx. per word bound = -7.833)
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (42 seconds).
## Completing Iteration 2 (approx. per word bound = -7.489, relative change = 4.392e-02)
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (40 seconds).
## Completing Iteration 3 (approx. per word bound = -7.440, relative change = 6.553e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (39 seconds).
## Completing Iteration 4 (approx. per word bound = -7.423, relative change = 2.311e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (41 seconds).
## Completing Iteration 5 (approx. per word bound = -7.414, relative change = 1.213e-03)
## Topic 1: perri, boycott, alabama, -curreni, wilkinson
## Topic 2: --court, jog, -male, single-sex, hew
## Topic 3: feinberg, implant, weinstein, claimant, anchorag
## Topic 4: superintendent’, homework, schooi, “school, miniatur
## Topic 5: groton, spock, anti-apartheid, livermor, anti-abort
## Topic 6: cato, loop, plaincloth, windshield, hustl
## Topic 7: mta, tribe, todd, vallon, valuat
## Topic 8: metz, coeduc, hellman, depalma, campus’
## Topic 9: pro-choic, cellular, ‘director, extravag, mismanag
## Topic 10: trucker, diesel, windowless, lockhe, updat
## Topic 11: peanut, ithaca, divestitur, -america, cornel
## Topic 12: lasker, rickey, inmat, premium, excrement
## Topic 13: recant, omit, dalton, resurrect, wisdom
## Topic 14: fisherman, dignitari, russia, passov, grope
## Topic 15: tariff, monongahela, energet, contracept, deregul
## Topic 16: greenhousespeci, rehnquist, constern, irrevers, greenhous
## Topic 17: forsyth, gamarekian, vatican, conference’, canton
## Topic 18: chief’, chattanooga, overtown, henderson, amic
## Topic 19: arsonist, deton, ablaz, viola, string
## Topic 20: ingredi, fda, cancer, weaken, scientif
## Aspect 1: sald, harvey, assail, spectal, greenwich
## Aspect 2: convey, wouid, abortionist, profil, kim
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (41 seconds).
## Completing Iteration 6 (approx. per word bound = -7.408, relative change = 7.794e-04)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 7 (approx. per word bound = -7.404, relative change = 5.094e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 8 (approx. per word bound = -7.401, relative change = 3.520e-04)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 9 (approx. per word bound = -7.400, relative change = 2.433e-04)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (42 seconds).
## Completing Iteration 10 (approx. per word bound = -7.398, relative change = 1.845e-04)
## Topic 1: wilkinson, perri, alabama, naacp, boycott
## Topic 2: nondiscriminatori, -male, jog, local’, hew
## Topic 3: feinberg, implant, herbicid, weinstein, claimant
## Topic 4: superintendent’, schooi, homework, miniatur, doc
## Topic 5: livermor, groton, spock, anti-abort, blockad
## Topic 6: hustl, windshield, cato, loop, afir
## Topic 7: mta, single-famili, tribe’, streamlin, todd
## Topic 8: metz, faculty’, trustees’, hellman, newest
## Topic 9: rescue’, pro-choic, ‘director, dens, nude
## Topic 10: trucker, diesel, windowless, strikers’, freight
## Topic 11: ithaca, carnegi, peanut, -america, winston-salem
## Topic 12: inmat, lasker, rickey, inmates’, ventil
## Topic 13: recant, dalton, omit, resurrect, clemenc
## Topic 14: dignitari, butterfield, fisherman, russia, grope
## Topic 15: tariff, monongahela, clairton, energet, officiai
## Topic 16: greenhousespeci, rehnquist, greenhous, restat, ste
## Topic 17: forsyth, royal, gamarekian, klan’, ratif
## Topic 18: chief’, chattanooga, patrolmen’, overtown, boardwalk
## Topic 19: hijack, arsonist, deton, ablaz, viola
## Topic 20: pharmaceut, fda, ingredi, aerospac, cancer
## Aspect 1: greenwich, spectal, “don’t, sald, retail
## Aspect 2: convey, wouid, abortionist, profil, rhetor
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (39 seconds).
## Completing Iteration 11 (approx. per word bound = -7.397, relative change = 1.959e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## ....................................................................................................
## Completed M-Step (39 seconds).
## Completing Iteration 12 (approx. per word bound = -7.396, relative change = 1.120e-04)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (40 seconds).
## Completing Iteration 13 (approx. per word bound = -7.395, relative change = 1.035e-04)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (40 seconds).
## Completing Iteration 14 (approx. per word bound = -7.395, relative change = 7.604e-05)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (40 seconds).
## Completing Iteration 15 (approx. per word bound = -7.394, relative change = 6.286e-05)
## Topic 1: wilkinson, perri, naacp, alabama, south”
## Topic 2: nondiscriminatori, --court, -male, jog, local’
## Topic 3: feinberg, implant, herbicid, weinstein, claimant
## Topic 4: superintendent’, schooi, homework, doc, miniatur
## Topic 5: groton, livermor, blockad, spock, anti-abort
## Topic 6: windshield, hustl, cato, accost, loop
## Topic 7: mta, tribe’, single-famili, valuat, todd
## Topic 8: metz, faculty’, coeduc, hellman, newest
## Topic 9: rescue’, mismanag, museum’, ‘director, cellular
## Topic 10: trucker, diesel, strikers’, windowless, peru
## Topic 11: ithaca, rabbit, peanut, -america, carnegi
## Topic 12: inmat, lasker, riker, inmates’, rickey
## Topic 13: recant, resurrect, singular, clemenc, wisdom
## Topic 14: dignitari, solemn, fisherman, butterfield, russia
## Topic 15: tariff, lutheran, monongahela, solberg, clairton
## Topic 16: greenhousespeci, rehnquist, restat, ste, irrevers
## Topic 17: forsyth, gamarekian, klan’, ratif, conference’
## Topic 18: chief’, chattanooga, overtown, patrolmen’, manslaught
## Topic 19: hijack, arsonist, viola, deton, ablaz
## Topic 20: aerosol, pharmaceut, fda, ingredi, aerospac
## Aspect 1: greenwich, “don’t, spectal, say, albert
## Aspect 2: convey, wouid, abortionist, profil, rhetor
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 16 (approx. per word bound = -7.394, relative change = 4.695e-05)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 17 (approx. per word bound = -7.394, relative change = 4.760e-05)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 18 (approx. per word bound = -7.393, relative change = 3.990e-05)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Completing Iteration 19 (approx. per word bound = -7.393, relative change = 6.235e-05)
## ....................................................................................................
## Completed E-Step (0 seconds).
## ....................................................................................................
## Completed M-Step (38 seconds).
## Model Terminated Before Convergence Reached
##Next, we plot using the ,type="perspectives" option to the plot.STM function
plot.STM(stmContent,type="perspectives", topics=3)
Now let us use supplement packages to visualize stm outputs.
stmprinter: Print multiple stm model dashboards to a pdf file for inspection. Beautiful automated reports from multiple stm runs.
stminsights: A Shiny Application for Inspecting Structural Topic Models. A shiny GUI with beautiful graphics.
themetagenomics: Exploring Thematic Structure and Predicted Functionality of 16s rRNA Amplicon Data. . STM for rRNA data.
tidystm: Extract (tidy) effects from estimateEffect. Makes it easy to make ggplot2 graphics for STM.
stmgui: Shiny Application for Creating STM Models” . This is a Shiny GUI for running basic STM models.
stmBrowser: An R Package for the Structural Topic Model Browser.’’ This D3 visualization allows users to interactively explore the relationships between topics and the covariates estimated from the stm package in R.
stmCorrViz: A Tool for Structural Topic Model Visualizations. This package uses D3 to generate an interactive hierarchical topic explorer.
p_load(stmprinter,stminsights,themetagenomics, tidystm,stmgui,
stmBrowser,stmCorrViz)#devtools::install_github("mikajoh/stmprinter")
#devtools::install_github("mikajoh/tidystm", dependencies = TRUE)
#devtools::install_github("mroberts/stmBrowser",dependencies=TRUE)
Let us use stmBrowser to visualize our topic models. Check here for more details https://github.com/mroberts/stmBrowser. The major function is stmBrowser.
::stmCorrViz(stmFit,
stmCorrVizfile_out = "./lab7-viz.html",
documents_raw=data$shortdoc)
## Inspecting valid thresholds via grid search. Progress:
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 38%
|
|============================ | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================= | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|=================================== | 49%
|
|=================================== | 51%
|
|==================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
You can check here for the generated html visualization file.