Sentiment Analysis
Learning Objects
The primary goals of this tutorial includes:
Introducing basic lexicon-based sentiment analysis in R
Introducing conventional ML based SA in R
Introducing DL based SA in R
Lexicon-based SA
We need to load some packages for use
if (!requireNamespace("pacman")) install.packages('pacman')
## Loading required namespace: pacman
library(pacman)
p_load(tidyverse,sentimentr, lexicon,syuzhet,parallel)
Load sample data on earnings call transcripts
R tidyverse package provides a series of useful data wrangling tools. You can check it here https://www.tidyverse.org/.
We use earning call transcripts as an illustrative example. We mentioned that the QJE paper on firm level political risk is computed based on earnings call data. You can get their final dataset via https://www.firmlevelrisk.com/
We use some of these earnings call excerpts. It has already been parsed and it is structured by speaker-session. You can get the sample data via https://yongjunzhang.com/files/css/ect_sample.RData
# load data
# load(url("https://yongjunzhang.com/files/css/ect_sample.RData"))
# you can also download the data and then load it
load("ect_sample.RData")
<- ect_sample %>%
ect_sample filter(!is.na(tic))
# Show a table for visual check
::kable(ect_sample[1:10,],cap="Earnings call parsed data") knitr
SA_ID | speakerName | speakerTitle | text | tic |
---|---|---|---|---|
3125276 | Peter R. Huntsman | President_Chief Executive Officer & Director | Thank you. | HUN |
3499636 | Bill Griffiths | Chairman_President_CEO | Thank you. | NX |
3403385 | Ewout Steenbergen | Executive Vice President and Chief Financial Officer | Ryan, in general terms there is not a lot we can say at this point in time. We go through our normal procedures as we do every year during the third quarter. We do our internal studies, we supplement that with external studies, we get external and internal actuarial advice on that. Our auditors are involved, we look at economic assumptions, we look at actuarial assumptions, we look at policy holder behavioral assumptions and so on. And then we come to conclusions. At this point in time we had not drawn any conclusions and there is no further commentary we can give. Obviously, we will come back to you with a clearer explanation with all the details during our third quarter earnings call. | VOYA |
3434086 | Boyd Hoback | President & Chief Executive Officer | Thank you, Karen. Thank you all for joining us again. Appreciate your time. We’re real excited obviously about the progress we’re making on Bad Daddy’s and continued progress on Good Times. We look forward to talking to you next quarter. Thank you. | GTIM |
3399295 | Elizabeth Smith | Chairman and Chief Executive Officer | The first one is it will have full functionality. Release one will have the click-through seating, where am I on the waitlist, is my table ready and the ability to pay at the table. Released two, which will come upon after that we will have some even cooler things that I don’t want to get into for competitive purposes on that, so that should be terrific. In terms of the loyalty program, as you know, we have been really far ahead on this. That is really one of the advantages of our portfolio. We really like what we see with our Dine Rewards program. Having four brands allows us to have a loyalty program that people don’t get tired with because they can dine across the four brands. We’ve expanded into Georgia this year, out of its five test markets. And that is going to be a piece of the go-forward strategy when we complete that task. But so far we really like what we are seeing. And any app that we would come out with in the future would dovetail really nicely with the loyalty program. | BLMN |
3185806 | Catherine Long | Chief Financial Officer_Treasurer & Executive VP | Good morning. | STOR |
3845316 | Ralph G. D’Ambrosio | Chief Financial Officer & Senior Vice President | Sure. So we expect the book-to-bill ratio in Electronic Systems and Communication Systems to be slightly above 1.0 for this year, and in Aerospace just under 1.0. Now, there’s a lot of opportunities, some that Mike talked about, that we’re pursuing in Aerospace. And to the extent that we can capitalize on them, that book-to-bill ratio in Aerospace Systems should move to 1.0. But from what we’re seeing today, I have it just under 1.0 for 2016. | LLL |
2417295 | Mark Dankberg | Chairman of the Board_Chief Executive Officer | We are at the very front end of the commercial air part. So it is beginning to be more meaningful but not yet. But you are probably not far off. And the R&D thing, Shawn? | VSAT |
3930666 | Michael Dennis Casey | Carter’s_Inc. | Thanks, Rick. | CRI |
3610666 | Jerry Moyes | Founder_Chief Executive Officer | We agree that opportunity exists for margin expansion in all four of our operating segments. In our trucking segment, truck load dedicated in swift refrigerated. We believe we can improve margins even in a soft break environment as we hyper focus like never before on utilization, improved safety and increased driver retention. In our intermodal segment margins should improve as we continue to work to fill the existing container fleet and meet our turn objectives of two turns a month. Reaching these levels of turns will significantly improve our chassis and dray costs and as a result create opportunities to expand our intermodal margins. Additionally we have nearly completed the build out of our dray operational infrastructure, which is a critical component to improving and sustaining our desired intermodal margins. | KNX |
R syuzhet Package
Let us use syuzhet package as the baseline. Matthew Jockers created the syuzhet package that utilizes dictionary lookups for the Bing, NRC, and Afinn methods as well as a custom dictionary. He also utilizes a wrapper for the Stanford coreNLP which uses much more sophisticated analysis.
You can check here for a tutorial for syuzhet: https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html
We are going to use its functions:
get_sentences()
:implements the openNLP sentence tokenizer (tokenize your texts into sentence level)get_tokens
:tokenize by words instead of sentencesget_sentiment()
:includes two parameters–a character vector (of sentences or words) and a “method.” The method you select determines which of the four available sentiment extraction methods to employ. In the example that follows below, the “syuzhet” (default) method is called.Other methods include “bing”, “afinn”, “nrc”, and “ç” (YZ comment: i.e., using different dictionary lookups)get_nrc_sentiment
: implements Saif Mohammad’s NRC Emotion lexicon. The NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive) (See http://www.purl.org/net/NRCemotionlexicon).get_percentage_values
:divides a text into an equal number of “chunks” and then calculates the mean sentiment valence for each.simple_plot
:akes a sentiment vector and applies three smoothing methods. The smoothers include a moving average, loess, and discrete cosine transformation.
Some Notes:
afinn is an English word list developed by Finn Nielsen. Words scores range from minus five (negative) to plus five (positive). You can download the original list via http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html;
bing opinion lexicon can be accessed via https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html;
nrc is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive), and it can can be accessed via https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
# Let use tokenize our texts first
<- ect_sample$text
my_example_text <- get_sentences(my_example_text)
s_v head(s_v)
## [1] "Thank you."
## [2] "Thank you."
## [3] "Ryan, in general terms there is not a lot we can say at this point in time."
## [4] "We go through our normal procedures as we do every year during the third quarter."
## [5] "We do our internal studies, we supplement that with external studies, we get external and internal actuarial advice on that."
## [6] "Our auditors are involved, we look at economic assumptions, we look at actuarial assumptions, we look at policy holder behavioral assumptions and so on."
<- get_sentiment(s_v, method="syuzhet")
syuzhet_vector head(syuzhet_vector)
## [1] 0.50 0.50 0.40 0.00 0.00 -0.25
Let us try different methods (use different lexicons)
<- get_sentiment(s_v, method="bing")
bing_vector head(bing_vector)
## [1] 1 1 0 0 0 0
<- get_sentiment(s_v, method="afinn")
afinn_vector head(afinn_vector)
## [1] 2 2 0 0 0 0
<- get_sentiment(s_v, method="nrc", lang = "english")
nrc_vector head(nrc_vector)
## [1] 0 0 1 0 0 0
Let us compare the difference
rbind(
sign(head(syuzhet_vector)),
sign(head(bing_vector)),
sign(head(afinn_vector)),
sign(head(nrc_vector))
)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 0 0 -1
## [2,] 1 1 0 0 0 0
## [3,] 1 1 0 0 0 0
## [4,] 0 0 1 0 0 0
Let us plot the result
plot(
syuzhet_vector, type="h",
main="Example Plot Trajectory",
xlab = "Narrative Time",
ylab= "Emotional Valence"
)
Let us see get_nrc_sentiment results
<- get_nrc_sentiment(s_v) nrc_data
## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## Please use `spread()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
<- which(nrc_data$anger > 0)
angry_items 1:10] s_v[angry_items][
## [1] "We're real excited obviously about the progress we're making on Bad Daddy's and continued progress on Good Times."
## [2] "But we're not really beating our self up or anything about where we stand today."
## [3] "We've deployed a fair amount of money in buybacks because we think it's a good buy and based on what our outlook is for cash flow and acquisition pipeline and timing of deals and where we want to be from a leverage standpoint, I don't think it's changed at all."
## [4] "But, you got – the folks off the bench as Jack mentioned that hurt our gross margins in the first quarter but folks are on project and we changed our model a little bit in that – we are going to do a little bit plus bench moving forward for Q2 and Q3 and probably in the Q4 than we have up until now and use the market a little bit more to make that sales force practice and I'm convinced that would be fine."
## [5] "We are way early in the process and I don't want to confuse anyone or scare anyone, right, that we are going to change the counseling because that's not the plan."
## [6] "I expect that to be a bit of synergy there that given our SG&A model versus most of our peers but nothing on the side in recruiting side."
## [7] "We also had strong product expansion with the on-demand service going live, NSX service, hybrid networking."
## [8] "Well, I think that something that management is going to decide depending on what type of pickup and demand we see and the type of margins we're going to see."
## [9] "I think that this is an area that we feel we could generate cash and that won't be an incremental expenses."
## [10] "And then the other thing that plays into it of course, with being on a cash basis for more than half of our revenue is the timing of reimbursement one quarter versus the next."
<- which(nrc_data$joy > 0)
joy_items 1:10] s_v[joy_items][
## [1] "We're real excited obviously about the progress we're making on Bad Daddy's and continued progress on Good Times."
## [2] "Release one will have the click-through seating, where am I on the waitlist, is my table ready and the ability to pay at the table."
## [3] "Good morning."
## [4] "We believe we can improve margins even in a soft break environment as we hyper focus like never before on utilization, improved safety and increased driver retention."
## [5] "In our intermodal segment margins should improve as we continue to work to fill the existing container fleet and meet our turn objectives of two turns a month."
## [6] "Reaching these levels of turns will significantly improve our chassis and dray costs and as a result create opportunities to expand our intermodal margins."
## [7] "Good, good to hear your voice."
## [8] "We're happy with our offering."
## [9] "So big focus, top priority, and we're making good progress."
## [10] "With respect to pricing of those types of assets, I don't really have any comment on that."
<- (nrc_data[, 9]*-1) + nrc_data[, 10]
valence 1:10] valence[
## [1] 0 0 1 0 0 0 0 0 0 0
barplot(
sort(colSums(prop.table(nrc_data[, 1:8]))),
horiz = TRUE,
cex.names = 0.7,
las = 1,
main = "Emotions in Sample Earnings Call Transcripts", xlab="Percentage"
)
You can also customize your lexicons
<- "I love when I see something beautiful. I hate it when ugly feelings creep into my head."
my_text <- get_sentences(my_text)
char_v <- "custom"
method <- data.frame(word=c("love", "hate", "beautiful", "ugly"), value=c(1,-1,1, -1))
custom_lexicon <- get_sentiment(char_v, method = method, lexicon = custom_lexicon)
my_custom_values my_custom_values
## [1] 2 -2
Let Us Move to sentimentR package
YZ’s comment: because the Syuzhet package does not account for negation and valence shifter. I typically use sentimentr as the default package for sentiment analysis in R. Note that sentimentr use some functions from Syuzhet, for instance, get_sentences()
.
Another R package that deal with valen shifter is https://cran.r-project.org/web/packages/vader/vader.pdf. You can use the function get_vader()
.
This tutorial is based on Tyler Rinke’s amazing one: https://github.com/trinker/sentimentr
-> sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup. The next questions address why it matters.
-> A negator flips the sign of a polarized word (e.g., “I do not like it.”). See lexicon::hash_valence_shifters[y==1] for examples. An amplifier (intensifier) increases the impact of a polarized word (e.g., “I really like it.”). See lexicon::hash_valence_shifters[y==2] for examples. A de-amplifier (downtoner) reduces the impact of a polarized word (e.g., “I hardly like it.”). See lexicon::hash_valence_shifters[y==3] for examples. An adversative conjunction overrules the previous clause containing a polarized word (e.g., “I like it but it’s not worth it.”). See lexicon::hash_valence_shifters[y==4] for examples.
-> Well valence shifters affect the polarized words. In the case of negators and adversative conjunctions the entire sentiment of the clause may be reversed or overruled. So if valence shifters occur fairly frequently a simple dictionary lookup may not be modeling the sentiment appropriately. You may be wondering how frequently these valence shifters co-occur with polarized words, potentially changing, or even reversing and overruling the clause’s sentiment. The table below shows the rate of sentence level co-occurrence of valence shifters with polarized words across a few types of texts.
Functions
There are two main functions (top 2 in table below) in sentimentr with several helper functions summarized in the table below:
Function | Description |
---|---|
sentiment |
Sentiment at the sentence level |
sentiment_by |
Aggregated sentiment by group(s) |
profanity |
Profanity at the sentence level |
profanity_by |
Aggregated profanity by group(s) |
emotion |
Emotion at the sentence level |
emotion_by |
Aggregated emotion by group(s) |
uncombine |
Extract sentence level sentiment from sentiment_by |
get_sentences |
Regex based string to sentence parser (or get sentences from sentiment /sentiment_by ) |
replace_emoji |
repalcement |
replace_emoticon |
Replace emoticons with word equivalent |
replace_grade |
Replace grades (e.g., “A+”) with word equivalent |
replace_internet_slang |
replacment |
replace_rating |
Replace ratings (e.g., “10 out of 10”, “3 stars”) with word equivalent |
as_key |
Coerce a data.frame lexicon to a polarity hash key |
is_key |
Check if an object is a hash key |
update_key |
Add/remove terms to/from a hash key |
highlight |
Highlight positive/negative sentences as an HTML document |
general_rescale |
Generalized rescaling function to rescale sentiment scoring |
sentiment_attribute |
Extract the sentiment based attributes from a text |
validate_sentiment |
Validate sentiment score sign against known results |
The Equation
The equation below describes the augmented dictionary method of sentimentr that may give better results than a simple lookup dictionary approach that does not consider valence shifters. The equation used by the algorithm to assign value to polarity of each sentence fist utilizes a sentiment dictionary (e.g., Jockers, (2017)) to tag polarized words. Each paragraph (\(p_i = \{s_1, s_2, ..., s_n\}\)) composed of sentences, is broken into element sentences (\(s_i,j = \{w_1, w_2, ..., w_n\}\)) where \(w\) are the words within sentences. Each sentence (\(s_j\)) is broken into a an ordered bag of words. Punctuation is removed with the exception of pause punctuations (commas, colons, semicolons) which are considered a word within the sentence. I will denote pause words as \(cw\) (comma words) for convenience. We can represent these words as an i,j,k notation as \(w_{i,j,k}\). For example \(w_{3,2,5}\) would be the fifth word of the second sentence of the third paragraph. While I use the term paragraph this merely represent a complete turn of talk. For example it may be a cell level response in a questionnaire composed of sentences.
The words in each sentence (\(w_{i,j,k}\)) are searched and compared to a dictionary of polarized words (e.g., a combined and augmented version of Jocker’s (2017) [originally exported by the syuzhet package] & Rinker’s augmented Hu & Liu (2004) dictionaries in the lexicon package). Positive (\(w_{i,j,k}^{+}\)) and negative (\(w_{i,j,k}^{-}\)) words are tagged with a \(+1\) and \(-1\) respectively (or other positive/negative weighting if the user provides the sentiment dictionary). I will denote polarized words as \(pw\) for convenience. These will form a polar cluster (\(c_{i,j,l}\)) which is a subset of the a sentence (\(c_{i,j,l} \subseteq s_i,j\)).
The polarized context cluster (\(c_{i,j,l}\)) of words is pulled from around the polarized word (\(pw\)) and defaults to 4 words before and two words after \(pw\) to be considered as valence shifters. The cluster can be represented as (\(c_{i,j,l} = \{pw_{i,j,k - nb}, ..., pw_{i,j,k} , ..., pw_{i,j,k - na}\}\)), where \(nb\) & \(na\) are the parameters n.before
and n.after
set by the user. The words in this polarized context cluster are tagged as neutral (\(w_{i,j,k}^{0}\)), negator (\(w_{i,j,k}^{n}\)), amplifier [intensifier] (\(w_{i,j,k}^{a}\)), or de-amplifier [downtoner] (\(w_{i,j,k}^{d}\)). Neutral words hold no value in the equation but do affect word count (\(n\)). Each polarized word is then weighted (\(w\)) based on the weights from the polarity_dt
argument and then further weighted by the function and number of the valence shifters directly surrounding the positive or negative word (\(pw\)). Pause (\(cw\)) locations (punctuation that denotes a pause including commas, colons, and semicolons) are indexed and considered in calculating the upper and lower bounds in the polarized context cluster. This is because these marks indicate a change in thought and words prior are not necessarily connected with words after these punctuation marks. The lower bound of the polarized context cluster is constrained to \(\max \{pw_{i,j,k - nb}, 1, \max \{cw_{i,j,k} < pw_{i,j,k}\}\}\) and the upper bound is constrained to \(\min \{pw_{i,j,k + na}, w_{i,jn}, \min \{cw_{i,j,k} > pw_{i,j,k}\}\}\) where \(w_{i,jn}\) is the number of words in the sentence.
The core value in the cluster, the polarized word is acted upon by valence shifters. Amplifiers increase the polarity by 1.8 (.8 is the default weight (\(z\))). Amplifiers (\(w_{i,j,k}^{a}\)) become de-amplifiers if the context cluster contains an odd number of negators (\(w_{i,j,k}^{n}\)). De-amplifiers work to decrease the polarity. Negation (\(w_{i,j,k}^{n}\)) acts on amplifiers/de-amplifiers as discussed but also flip the sign of the polarized word. Negation is determined by raising \(-1\) to the power of the number of negators (\(w_{i,j,k}^{n}\)) plus \(2\). Simply, this is a result of a belief that two negatives equal a positive, 3 negatives a negative, and so on.
The adversative conjunctions (i.e., ‘but’, ‘however’, and ‘although’) also weight the context cluster. An adversative conjunction before the polarized word (\(w_{adversative\,conjunction}, ..., w_{i, j, k}^{p}\)) up-weights the cluster by 1 + \(z_2 * \{|w_{adversative\,conjunction}|, ..., w_{i, j, k}^{p}\}\) (.85 is the default weight (\(z_2\)) where \(|w_{adversative\,conjunction}|\) are the number of adversative conjunctions before the polarized word). An adversative conjunction after the polarized word down-weights the cluster by 1 + \(\{w_{i, j, k}^{p}, ..., |w_{adversative\,conjunction}| * -1\} * z_2\). This corresponds to the belief that an adversative conjunction makes the next clause of greater values while lowering the value placed on the prior clause.
The researcher may provide a weight (\(z\)) to be utilized with amplifiers/de-amplifiers (default is \(.8\); de-amplifier weight is constrained to \(-1\) lower bound). Last, these weighted context clusters (\(c_{i,j,l}\)) are summed (\(c'_{i,j}\)) and divided by the square root of the word count (√\(w_{i,jn}\)) yielding an unbounded polarity score (\(\delta_{i,j}\)) for each sentence.
\(\delta\)ij = c’ij/√wijn
Where:
\[c'_{i,j} = \sum{((1 + w_{amp} + w_{deamp})\cdot w_{i,j,k}^{p}(-1)^{2 + w_{neg}})}\]
\[w_{amp} = \sum{(w_{neg}\cdot (z \cdot w_{i,j,k}^{a}))}\]
\[w_{deamp} = \max(w_{deamp'}, -1)\]
\[w_{deamp'} = \sum{(z(- w_{neg}\cdot w_{i,j,k}^{a} + w_{i,j,k}^{d}))}\]
\[w_{b} = 1 + z_2 * w_{b'}\]
\[w_{b'} = \sum{(|w_{adversative\,conjunction}|, ..., w_{i, j, k}^{p}, w_{i, j, k}^{p}, ..., |w_{adversative\,conjunction}| * -1})\]
\(w_{neg}\) = (\(\sum{w_{i,j,k}^{n}}\) ) mod 2
To get the mean of all sentences (\(s_{i,j}\)) within a paragraph/turn of talk (\(p_{i}\)) simply take the average sentiment score \(p_{i,\delta_{i,j}}\) = 1/n \(\cdot\) \(\sum\) \(\delta_{i,j}\) or use an available weighted average (the default average_weighted_mixed_sentiment
which upweights the negative values in a vector while also downweighting the zeros in a vector or average_downweighted_zero
which simply downweights the zero polarity scores).
Preferred Workflow
Here is a basic sentiment
demo. Notice that the first thing you should do is to split your text data into sentences (a process called sentence boundary disambiguation) via the get_sentences
function. This can be handled within sentiment
(i.e., you can pass a raw character vector) but it slows the function down and should be done one time rather than every time the function is called. Additionally, a warning will be thrown if a larger raw character vector is passed. The preferred workflow is to spit the text into sentences with get_sentences
before any sentiment analysis is done.
<- c(
mytext 'do you like it? But I hate really bad dogs',
'I am the best friend.',
'Do you really like it? I\'m not a fan'
)<- get_sentences(mytext)
mytext sentiment(mytext)
## element_id sentence_id word_count sentiment
## 1: 1 1 4 0.2500000
## 2: 2 1 6 -1.8677359
## 3: 3 1 5 0.5813777
## 4: 4 1 5 0.4024922
## 5: 5 1 4 0.0000000
To aggregate by element (column cell or vector element) use sentiment_by
with by = NULL
.
<- c(
mytext 'do you like it? But I hate really bad dogs',
'I am the best friend.',
'Do you really like it? I\'m not a fan'
)<- get_sentences(mytext)
mytext sentiment_by(mytext)
## element_id word_count sd ave_sentiment
## 1: 1 4 NA 0.2500000
## 2: 2 6 NA -1.8677359
## 3: 3 5 NA 0.5813777
## 4: 4 5 NA 0.4024922
## 5: 5 4 NA 0.0000000
To aggregate by grouping variables use sentiment_by
using the by
argument.
<- with(
out
ect_sample, sentiment_by(
get_sentences(text),
list(tic)
) )
## Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
## raw `character` vector is passed to `text.var`. This may be costly of time and
## memory. It is highly recommended that the user first runs the raw `character`
## vector through the `get_sentences` function.
head(out)
## tic word_count sd ave_sentiment
## 1: <NA> 126209 0.2727835 0.2363563
## 2: A 35 NA 0.8062772
## 3: AACC 28 0.1961161 0.1513462
## 4: AAI 15 NA 0.2065591
## 5: AAL 11 NA 0.0000000
## 6: AAON 66 NA -0.1169369
Plotting at Aggregated Sentiment
# let us plot random 20 companies
<- sample_n(out,20)
out_20 plot(out_20)
You can Use the Custimized Dictionary as well
Because earnings call transcripts relate to financial domain, we use financial specific dictionary.
<- with(
out
ect_sample, sentiment_by(get_sentences(text),
polarity_dt=lexicon::hash_sentiment_loughran_mcdonald,
list(tic))
)
## Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
## raw `character` vector is passed to `text.var`. This may be costly of time and
## memory. It is highly recommended that the user first runs the raw `character`
## vector through the `get_sentences` function.
plot(sample_n(out,30))
Plotting at the Sentence Level
The plot
method for the class sentiment
uses syuzhet’s get_transformed_values
combined with ggplot2 to make a reasonable, smoothed plot for the duration of the text based on percentage, allowing for comparison between plots of different texts. This plot gives the overall shape of the text’s sentiment. The user can see syuzhet::get_transformed_values
for more details.
plot(uncombine(sample_n(out,30)))
Conventional ML based SA
On top of lexicon-based methods, you can also use supervised machine learning methods to implement sentiment analysis if you have a training dataset or you can find a pretrained model that fits into your research purpose appropriately.
The basic idea is very straightforward. You need to do some feature engineering to obtain a list of features used to classify sentiments (e.g., positive, negative, or neutral). So you can treat this as a classification problem.
One option is to use n-gram as your feature matrix and then train a classifier using quanteda package in R or you can also use caret pacakge as we have covered in previous lab session.
We will use IMDb Movie Review as an illustrative example. The IMDb Movie Reviews dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. The dataset contains an even number of positive and negative reviews. Only highly polarizing reviews are considered. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. No more than 30 reviews are included per movie. The dataset contains additional unlabeled data. https://paperswithcode.com/dataset/imdb-movie-reviews. We will use the toy dataset from quanteda’s data_corpus_moviereviews.. You can check this tutorial https://tutorials.quanteda.io/machine-learning/nb/
p_load(quanteda,quanteda.textmodels,caret)
<- data_corpus_moviereviews
corp_movies summary(corp_movies, 10)
## Corpus consisting of 2000 documents, showing 10 documents:
##
## Text Types Tokens Sentences sentiment id1 id2
## cv000_29416.txt 354 841 9 neg cv000 29416
## cv001_19502.txt 156 278 1 neg cv001 19502
## cv002_17424.txt 276 553 3 neg cv002 17424
## cv003_12683.txt 313 555 2 neg cv003 12683
## cv004_12641.txt 380 841 2 neg cv004 12641
## cv005_29357.txt 328 747 1 neg cv005 29357
## cv006_17022.txt 331 641 5 neg cv006 17022
## cv007_4992.txt 325 673 6 neg cv007 4992
## cv008_29326.txt 441 794 10 neg cv008 29326
## cv009_29417.txt 401 965 23 neg cv009 29417
Let us take a look at imdb movie review data first
table(corp_movies$sentiment)
##
## neg pos
## 1000 1000
Okay, we have 1k positive and 1k negative movie reviews. Let us use xgboost to classify these movies. You can always use other methods. We will use words as features.
# generate 1500 numbers without replacement
set.seed(300)
<- sample(1:2000, 1500, replace = FALSE)
id_train head(id_train, 10)
## [1] 590 874 1602 985 1692 789 553 1980 1875 1705
# create docvar with ID
$id_numeric <- 1:ndoc(corp_movies)
corp_movies
# tokenize texts
<- tokens(corp_movies, remove_punct = TRUE, remove_number = TRUE) %>%
toks_movies tokens_remove(pattern = stopwords("en")) %>%
tokens_wordstem()
<- dfm(toks_movies)
dfmt_movie
# get training set
<- dfm_subset(dfmt_movie, id_numeric %in% id_train)
dfmat_training
# get test set (documents not in id_train)
<- dfm_subset(dfmt_movie, !id_numeric %in% id_train)
dfmat_test
# Next we train the naive Bayes classifier using textmodel_nb().
<- textmodel_nb(dfmat_training, dfmat_training$sentiment)
tmod_nb summary(tmod_nb)
##
## Call:
## textmodel_nb.dfm(x = dfmat_training, y = dfmat_training$sentiment)
##
## Class Priors:
## (showing first 2 elements)
## neg pos
## 0.5 0.5
##
## Estimated Feature Scores:
## plot two teen coupl go church parti
## neg 0.002579 0.002318 0.0002870 0.0007157 0.002663 8.719e-05 0.0002652
## pos 0.001507 0.002338 0.0001656 0.0005456 0.002348 8.768e-05 0.0002728
## drink drive get accid one guy die
## neg 1.199e-04 0.0003052 0.004487 9.446e-05 0.007389 0.0014459 0.0005486
## pos 9.418e-05 0.0002630 0.003783 1.851e-04 0.007355 0.0009937 0.0005488
## girlfriend continu see life nightmar deal watch
## neg 0.0003124 0.0003161 0.002558 0.001435 0.0001199 0.0004323 0.001642
## pos 0.0002338 0.0003215 0.003020 0.002497 0.0001202 0.0005196 0.001539
## movi sorta find critiqu mind-fuck generat touch
## neg 0.010118 1.090e-05 0.001453 9.446e-05 3.633e-06 0.0002652 0.0002289
## pos 0.007657 1.624e-05 0.001630 8.443e-05 3.247e-06 0.0002923 0.0004449
## cool idea
## neg 0.0003052 0.0008210
## pos 0.0002273 0.0005845
# Naive Bayes can only take features into consideration that occur both in the training set and the test set, but we can make the features identical using dfm_match()
<- dfm_match(dfmat_test, features = featnames(dfmat_training))
dfmat_matched
# Let’s inspect how well the classification worked.
<- dfmat_matched$sentiment
actual_class <- predict(tmod_nb, newdata = dfmat_matched)
predicted_class <- table(actual_class, predicted_class)
tab_class tab_class
## predicted_class
## actual_class neg pos
## neg 213 45
## pos 37 205
# We can use the function confusionMatrix() from the caret package to assess the performance of the classification.
confusionMatrix(tab_class, mode = "everything")
## Confusion Matrix and Statistics
##
## predicted_class
## actual_class neg pos
## neg 213 45
## pos 37 205
##
## Accuracy : 0.836
## 95% CI : (0.8006, 0.8674)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.672
##
## Mcnemar's Test P-Value : 0.4395
##
## Sensitivity : 0.8520
## Specificity : 0.8200
## Pos Pred Value : 0.8256
## Neg Pred Value : 0.8471
## Precision : 0.8256
## Recall : 0.8520
## F1 : 0.8386
## Prevalence : 0.5000
## Detection Rate : 0.4260
## Detection Prevalence : 0.5160
## Balanced Accuracy : 0.8360
##
## 'Positive' Class : neg
##
Of course, you can achieve this just using caret package as well.
Deep Learning based SA
The following part introduces a simple DNN. You can get the original post via https://tensorflow.rstudio.com/guide/keras/examples/imdb_fasttext/. You can also check another tutorial via https://burtmonroe.github.io/TextAsDataCourse/Tutorials/TADA-IntroToKerasAndTensflowInR.nb.html
library(keras)
library(purrr)
# Function Definitions ----------------------------------------------------
<- function(input_list, ngram_value = 2){
create_ngram_set <- map(0:(length(input_list) - ngram_value), ~1:ngram_value + .x)
indices %>%
indices map_chr(~input_list[.x] %>% paste(collapse = "|")) %>%
unique()
}
<- function(sequences, token_indice, ngram_range = 2){
add_ngram <- map(
ngrams
sequences, ngram_value = ngram_range
create_ngram_set,
)
<- map2(sequences, ngrams, function(x, y){
seqs <- token_indice$token[token_indice$ngrams %in% y]
tokens c(x, tokens)
})
seqs
}
# Parameters --------------------------------------------------------------
# ngram_range = 2 will add bi-grams features
<- 2
ngram_range <- 20000
max_features <- 400
maxlen <- 32
batch_size <- 50
embedding_dims <- 5
epochs
# Data Preparation --------------------------------------------------------
# Load data
<- dataset_imdb(num_words = max_features)
imdb_data
# Train sequences
print(length(imdb_data$train$x))
print(sprintf("Average train sequence length: %f", mean(map_int(imdb_data$train$x, length))))
# Test sequences
print(length(imdb_data$test$x))
print(sprintf("Average test sequence length: %f", mean(map_int(imdb_data$test$x, length))))
if(ngram_range > 1) {
# Create set of unique n-gram from the training set.
<- imdb_data$train$x %>%
ngrams map(create_ngram_set) %>%
unlist() %>%
unique()
# Dictionary mapping n-gram token to a unique integer
# Integer values are greater than max_features in order
# to avoid collision with existing features
<- data.frame(
token_indice ngrams = ngrams,
token = 1:length(ngrams) + (max_features),
stringsAsFactors = FALSE
)
# max_features is the highest integer that could be found in the dataset
<- max(token_indice$token) + 1
max_features
# Augmenting x_train and x_test with n-grams features
$train$x <- add_ngram(imdb_data$train$x, token_indice, ngram_range)
imdb_data$test$x <- add_ngram(imdb_data$test$x, token_indice, ngram_range)
imdb_data
}
# Pad sequences
$train$x <- pad_sequences(imdb_data$train$x, maxlen = maxlen)
imdb_data$test$x <- pad_sequences(imdb_data$test$x, maxlen = maxlen)
imdb_data
# Model Definition --------------------------------------------------------
<- keras_model_sequential()
model
%>%
model layer_embedding(
input_dim = max_features, output_dim = embedding_dims,
input_length = maxlen
%>%
) layer_global_average_pooling_1d() %>%
layer_dense(1, activation = "sigmoid")
%>% compile(
model loss = "binary_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
# Fitting -----------------------------------------------------------------
%>% fit(
model $train$x, imdb_data$train$y,
imdb_databatch_size = batch_size,
epochs = epochs,
verbose = 2,
validation_data = list(imdb_data$test$x, imdb_data$test$y)
)