This vignette describes how can time series be derived from a topic model using document’s dates and optionally document’s sentiment. Please refer to the “Basic usage” vignette for an introduction to topic model estimation.
The example dataset shipped with the package already contains two
docvars: .date
and .sentiment
. Using
these exact names, these two will be considered as internal dates and
internal sentiment by sentopics
when creating a topic
model. Those values may be accessed or modified through the helper
functions sentopics_date()
and
sentopics_sentiment()
.
library("xts")
library("data.table")
library("sentopics")
data("ECB_press_conferences_tokens")
head(docvars(ECB_press_conferences_tokens))
# .date doc_id title
# 1 1998-06-09 1 ECB Press conference: Introductory statement
# 2 1998-06-09 1 ECB Press conference: Introductory statement
# 3 1998-06-09 1 ECB Press conference: Introductory statement
# 4 1998-06-09 1 ECB Press conference: Introductory statement
# 5 1998-06-09 1 ECB Press conference: Introductory statement
# 6 1998-06-09 1 ECB Press conference: Introductory statement
# section_title
# 1 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 2 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 3 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 4 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 5 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 6 Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# .sentiment
# 1 -0.01470588
# 2 -0.02500000
# 3 0.00000000
# 4 0.00000000
# 5 0.00000000
# 6 0.00000000
set.seed(123)
lda <- LDA(ECB_press_conferences_tokens, K = 9, alpha = 1, beta = 0.001)
head(sentopics_date(lda))
# .id .date
# <char> <Date>
# 1: 1_1 1998-06-09
# 2: 1_2 1998-06-09
# 3: 1_3 1998-06-09
# 4: 1_4 1998-06-09
# 5: 1_5 1998-06-09
# 6: 1_6 1998-06-09
head(sentopics_sentiment(lda))
# .id .sentiment
# <char> <num>
# 1: 1_1 -0.01470588
# 2: 1_2 -0.02500000
# 3: 1_3 0.00000000
# 4: 1_4 0.00000000
# 5: 1_5 0.00000000
# 6: 1_6 0.00000000
For this example, the documents’ sentiment were computed using the
sentometrics
package. For further details on this sentiment
computation, please refer to the script used in /data-raw/
on GitHub.
Now that the lda
object contains dates and sentiment, we
already have enough information to compute a sentiment index using
sentiment_series()
which aggregates document per period. By
default, it returns a xts
object.
Estimating the topic model will allow enriching this sentiment series with topical content. The model should be estimated until it returns satisfactory topics. Labeling the topics facilitates the subsequent analysis.
lda <- fit(lda, 1000)
sentopics_labels(lda) <- list(
topic = c(
"Economic growth & Inflation", "Banking", "Payment services",
"European single market", "Monetary policy & Negative rate",
"Monetary policy & Price stability", "Others", "Banking supervision",
"Financial markets"
)
)
plot(lda)
The estimated topic model adds a layer of topical proportions to the
existing documents. This appears clearly when using melt()
on the model. Leveraging on the topic and sentiment information at the
document level we can compute the share of sentiment that belong to a
given topic.
document_datas <- sentopics::melt(lda, include_docvars = TRUE)
head(document_datas)
# topic prob .date .id doc_id
# <fctr> <num> <Date> <char> <char>
# 1: Economic growth & Inflation 0.07692308 1998-06-09 1_1 1
# 2: Economic growth & Inflation 0.03125000 1998-06-09 1_2 1
# 3: Economic growth & Inflation 0.06666667 1998-06-09 1_3 1
# 4: Economic growth & Inflation 0.10526316 1998-06-09 1_4 1
# 5: Economic growth & Inflation 0.05000000 1998-06-09 1_5 1
# 6: Economic growth & Inflation 0.04347826 1998-06-09 1_6 1
# title
# <char>
# 1: ECB Press conference: Introductory statement
# 2: ECB Press conference: Introductory statement
# 3: ECB Press conference: Introductory statement
# 4: ECB Press conference: Introductory statement
# 5: ECB Press conference: Introductory statement
# 6: ECB Press conference: Introductory statement
# section_title
# <char>
# 1: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 2: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 3: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 4: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 5: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# 6: Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
# .sentiment .sentiment_scaled
# <num> <num>
# 1: -0.01470588 -2.8778373
# 2: -0.02500000 -4.6258752
# 3: 0.00000000 -0.3806404
# 4: 0.00000000 -0.3806404
# 5: 0.00000000 -0.3806404
# 6: 0.00000000 -0.3806404
head(document_datas[, list(.date, topic, share_of_sentiment = prob * .sentiment), keyby = ".id"])
# Key: <.id>
# .id .date topic share_of_sentiment
# <char> <Date> <fctr> <num>
# 1: 100_1 2006-06-08 Economic growth & Inflation 0.002545455
# 2: 100_1 2006-06-08 Banking 0.005090909
# 3: 100_1 2006-06-08 Payment services 0.002545455
# 4: 100_1 2006-06-08 European single market 0.002545455
# 5: 100_1 2006-06-08 Monetary policy & Negative rate 0.033090909
# 6: 100_1 2006-06-08 Monetary policy & Price stability 0.002545455
Using this share of sentiment and the documents’ date, one may
compute two additional outputs: a breakdown of the sentiment time series
and a time series of the sentiment expressed by each topic. The
difference between the two outputs rely on the aggregation between
documents. The breakdown averages documents’ share of sentiment with an
equal weighting, whereas computing the sentiment expressed by a topic
requires weighting documents by their attention to this given topic.
These two aggregations are implemented through the
sentiment_breakdown()
and sentiment_topics()
functions.
head(na.omit(sentiment_breakdown(lda, period = "month", rolling_window = 6)))
# sentiment Economic growth & Inflation Banking Payment services
# 1998-11-01 -1.0298569 -0.1151266 -0.07707704 -0.12772970
# 1998-12-01 -1.0735310 -0.1210672 -0.11188915 -0.13411588
# 1999-01-01 -0.9010599 -0.1172294 -0.09564143 -0.08154807
# 1999-02-01 -1.1255928 -0.1479729 -0.11047193 -0.09931550
# 1999-03-01 -1.2070330 -0.1985380 -0.12178704 -0.08551730
# 1999-04-01 -1.4144708 -0.2335188 -0.14502420 -0.10177635
# European single market Monetary policy & Negative rate
# 1998-11-01 -0.07785681 -0.2273722
# 1998-12-01 -0.08116835 -0.1873457
# 1999-01-01 -0.07046946 -0.1351671
# 1999-02-01 -0.09471892 -0.1828060
# 1999-03-01 -0.11260427 -0.1895354
# 1999-04-01 -0.11947086 -0.2441200
# Monetary policy & Price stability Others Banking supervision
# 1998-11-01 -0.06613374 -0.06733011 -0.1180358
# 1998-12-01 -0.06897499 -0.06746480 -0.1268454
# 1999-01-01 -0.04998243 -0.06777561 -0.1113562
# 1999-02-01 -0.06554454 -0.08750756 -0.1362076
# 1999-03-01 -0.06821770 -0.09840632 -0.1276992
# 1999-04-01 -0.09015224 -0.11476663 -0.1471136
# Financial markets
# 1998-11-01 -0.1531949
# 1998-12-01 -0.1746596
# 1999-01-01 -0.1718902
# 1999-02-01 -0.2010478
# 1999-03-01 -0.2047278
# 1999-04-01 -0.2185282
head(na.omit(sentiment_topics(lda, period = "month", rolling_window = 6)))
# Economic growth & Inflation Banking Payment services
# 1998-11-01 -1.456447 -1.177686 -1.4257697
# 1998-12-01 -1.552844 -1.441269 -1.5192832
# 1999-01-01 -1.469062 -1.204640 -0.9547131
# 1999-02-01 -1.768688 -1.208604 -1.2344803
# 1999-03-01 -2.100009 -1.098580 -1.2065969
# 1999-04-01 -2.405278 -1.266237 -1.4085298
# European single market Monetary policy & Negative rate
# 1998-11-01 -1.149206 -0.6988520
# 1998-12-01 -1.156467 -0.6212429
# 1999-01-01 -1.055477 -0.4440184
# 1999-02-01 -1.259723 -0.7466193
# 1999-03-01 -1.303717 -0.9473358
# 1999-04-01 -1.364087 -1.2799911
# Monetary policy & Price stability Others Banking supervision
# 1998-11-01 -0.8472344 -0.6396175 -1.632095
# 1998-12-01 -0.8272929 -0.6404772 -1.796555
# 1999-01-01 -0.6291579 -0.6114367 -1.571457
# 1999-02-01 -0.7423391 -0.8948703 -1.734508
# 1999-03-01 -0.7329459 -1.0225091 -1.388107
# 1999-04-01 -0.9294124 -1.2413070 -1.598685
# Financial markets
# 1998-11-01 -1.554975
# 1998-12-01 -1.873536
# 1999-01-01 -1.870179
# 1999-02-01 -2.071598
# 1999-03-01 -1.995909
# 1999-04-01 -2.080132
Furthermore, these functions have embedded plot options, that are
directly accessible using the plot_
prefix.