Topic Model Evaluation
--

Topic models are easy to train, but do they generate useful topics? In this post, we discuss how to calculate several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.
To accomplish this, we use Mallet to generate 50 topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and 50 topics for a corpus of ~11 million Twitter posts related to COVID-19 We used hashtag pooling to generate topics for the Twitter corpus. We use Python to calculate diagnostic measures from a Mallet topic-term frequency output file. A subset of example topics are referenced in the following sections for illustration purposes, but all 100 topics were used for the PCA.
Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) topic frequency, and 2) top-term specificity. Furthermore, on average, we found common topics with specific top terms score significantly better on coherence scores than uncommon topics with unspecific top terms. However, we also found many poor topics score relatively high on coherence scores. In other words, our results suggest topics that capture the specific, main topics of a corpus should be easier to interpret, but the interpretability of a topic doesn’t imply a topic is specific or central to a corpus.
Calculating LDA Evaluation Metrics
The following section provides an overview of how to calculate and interpret various evaluation metrics. Reproducible code is provided at the following GitHub repo.
Tokens per topic
This metric measures the number of words assigned to each topic and can be used to assess topic reliability. The general rule of thumb is to look for topics with less extreme token counts. Low token counts indicate a topic appears infrequently in the corpus; hence, there may be too few observations to derive effective word distributions. In contrast, high token counts indicate a topic appears very frequently in the corpus; hence, the topic may be too general.

Below are a few examples of topics with low, medium, and high token counts in the Facebook corpus:

As expected, the topics with low token counts are less reliable (e.g., is the second topic about Africa, Australia, or Canada?); the topics with high token counts are too general (e.g., pandemic, safety, health, world, and country are not focused on a specific topic); and the topics with medium token counts appear to focus our attention on specific areas of interest (i.e., business and school closings).
Average Word Length
In theory, longer words should carry more specific meaning. Therefore, the average word length of the top-n terms per topic is used as a proxy for topic specificity.

Below are two examples of topics that have relatively large and small average word lengths. As expected, the topic with a larger average word length (international travel restrictions) is more specific than the topic with a smaller average word length (religion).

Relative Entropy (Uniform Distribution)
Relative entropy, which is also referred to as Kullback-Leibler divergence, measures the distance between a topic distribution (P) and an approximation of it. In this example, we use relative entropy to measure the amount of information lost when a uniform distribution instead of the Mallet-derived topic distributions. Larger values indicate more specificity.

Below are two examples of topics that have relatively large and small uniform entropy values. As expected, the topic with a larger value addresses a specific topic of virus testing; whereas, the topic with a smaller value addresses a more general topic of news podcasts — in other words, we don’t know what the podcasts are about.

Relative Entropy (Corpus)
This metric is identical to our previous entropy calculation except we measure the KL-divergence distance between our topic distribution and the overall distribution of words in the corpus. Smaller values indicate a topic is distinct; larger values indicate a topic is similar to the corpus distribution. Furthermore, this metric correlates with the number of tokens per topic since these topics will occur more often in the corpus.

Below are two examples of topics that have relatively large and small corpus entropy values. As expected, the topic with a larger value addresses a specific topic of California-specific news; whereas, the topic with a smaller value addresses a more general topic of a national response — in other words, we don’t know what the podcasts are about.

Effective Number of Words
The idea behind this measure is to count terms and, at the same time, to weight the counts by their relative strength. In other words, for each topic, how many of a topic’s top terms are also top terms in other topics? For each word in a topic, we calculate the inverse of the squared probability of the word in the topic and then sum the values. Larger numbers indicate more specificity, and this measure correlates (negatively) with uniform entropy.

Below are two examples of topics that have relatively large and small values for the effective number of words. As expected, the topic with a larger value addresses the specific topic of virus testing; whereas, the topic with a smaller value addresses the more general topic of news podcasts.

Exclusivity
This metric measures the extent to which the top words for a topic do not appear as top words in other topics. In other words, it measures the extent to which top words are “exclusive” to a given topic. We calculate exclusivity as the average, over each top word, of the probability of that word in the topic divided by the sum of the probabilities of that word in all topics. High exclusivity correlates (negatively) with token count but also indicates more general topics.

Below are two examples of topics that have relatively large and small exclusivity values. As expected, the topic with a larger value (California-specific news) is more exclusive than the topic with a smaller value (safety).

Coherence
Coherence measures whether the words in a topic tend to co-occur. To calculate coherence, we first sum the log probability that a document containing at least one instance of the higher-ranked word also contains at least one instance of the lower-ranked word. To avoid log zero errors, we add a “beta” topic-word smoothing parameter. Large negative values indicate words that do not co-occur often; values closer to zero indicate that words co-occur more often. We did not have access to the document term matrix, so we used the gensim CoherenceModel function.

Below are two examples of topics that have relatively large and small coherence values. As expected, the topic with a larger negative value (vaccines for people and pets) is not very coherent — is this topic about pet vaccines? human vaccines? or Bill Gates? The topic with a smaller negative value (Wuhan outbreak) is very coherent — Wuhan is likely to co-occur with China, and articles about Wuhan are likely to mention the virus outbreak.

Principal Component Analysis
The following section provides an overview of how to conduct PCA on the evaluation metrics. Since hashtags artificially lengthen vocabulary by concatenating phrases, average word length was excluded from the PCA. Reproducible code and interactive score plots are available at the following GitHub page.
Data Standardization
First, we compare the evaluation metrics associated with each of the data sets. The density plots below show there are clear differences between the diagnostic measures for Facebook and Twitter topics. For example, Facebook has better topic coherence scores which implies the top terms of each topic co-occur more often in Facebook posts. Likewise, Facebook topics have a larger tokens per topic count which indicates the top terms of each topic occur more frequently in the Facebook corpus.

Prior to performing PCA, we must standardize the data so the metrics have a common scale. We standardize the data so that each measure has a mean of zero and a standard deviation of 1. The density plots below show the distributions of the standardized data.

Correlation Analysis
Correlation analysis of the evaluation metrics shows that we are dealing with highly correlated multivariate data. There is a strong negative correlation between token count and exclusivity-in other words, exclusive topics don’t occur frequently in the corpus. We also see a negative correlation between average word length and the effective number of words which implies effective terms tend to be shorter in length. Corpus entropy has a strong positive correlation with exclusivity; whereas, coherence has a strong negative correlation with corpus entropy and exclusivity. In other words, the top terms for coherent topics tend to use words that are common to the corpus, and coherent topics tend to occur more frequently.

Eigen Analysis
A scree plot of the eigenvalues of the correlation matrix suggests we should retain two principal components (PCs). The general rule of thumb is to keep PCs that are “one less than the elbow” of the scree plot or PCs with an eigenvalue of 1 or greater.

Loading Analysis
The loading matrix below shows token count, corpus entropy, exclusivity, and coherence contribute the most to the 1st PC, which explains 44.6% of the variance based on the eigenanalysis (2.679/6 = 44.6%). Uniform entropy and the effective number of words contribute most to the 2nd PC, which explains 31.5% of the variance. Coherence contributes most to the 3rd PC, which explains 11.4% of the variance.

Based on the loading plot loading plots below, the 1st PC appears to capture topic frequency. Exclusivity and corpus entropy are positioned on the far left and imply a topic does not appear often in the corpus. Token count is positioned on the far right and implies a topic appears often in the corpus. The 2nd PC appears to capture top-term frequency. Effective number of words is positioned on the bottom and implies the top terms for individual topics are top terms in many topics. Uniform entropy is located at the top and indicates a topic distribution (as well as the top terms of that distribution) does not add much information when compared to using a uniform distribution.

Score Plot Analysis
We examine score plots of the PC values associated with each topic to validate our interpretation of the PCs. Interactive plots with popup text of the top-10 words for each topic are available to download in an HTML document on the GitHub repo. The interactive plots facilitate doing a more subjective evaluation of specificity and allow us to apply human intuition about which topics would likely be more prevalent in the corpora.
Sizing the score plot by token count supports our interpretation that the 1st PC captures topic frequency.

Likewise, sizing the points by the effective number of words supports our interpretation that the 2nd PC captures top-term specificity.

The score plot with points sized by coherence scores is less clear. It looks like the more coherent topics (i.e., the points with a smaller diameter) are concentrated more heavily in the top right quadrant and the less coherent scores are concentrated in the bottom left. However, there are many relatively small-sized points scattered throughout all four quadrants. Given that coherence scores are based on top terms co-occuring in documents, the pattern that emerges in the score plot makes more sense.

For example, the Twitter topic in the top left includes the following top terms: economy, people, urge, coronavirus, million, debt, package, needed, student, and stimulate. This topic is very coherent — terms like “stimulate” and “economy” or “student” and “debt” are common word pairings. Hence, it’s very plausible that posts about economic stimulus related to student debt could generate this topic. Likewise, it’s reasonable to think this topic would be relatively less central to a COVID-19 discussion focused on health risks, disease prevention, and government restrictions.
In contrast, the Facebook topic in the bottom right includes the following top terms: people, government, crisis, world, pandemic, time, coronavirus, political, country, and public. This topic is not very specific, but the top terms seem like words that would co-occur frequently. This topic also seems like it would be central to the COVID-19 discussion.
Box Plot of Coherence Scores by Quadrant
Plotting the normalized coherence scores by each quadrant of the score plot allows us to see that the coherence scores are significantly better for common topics with specific top terms. However, uncommon topics with unspecific terms have a large overlap with common topics with specific terms in the Twitter topics. Stated simply, coherence looks like a good measure most of the time, but not always.

Conclusion
So, what is the best metric to evaluate topic quality? It depends.
If the goal is to find topics that are most representative of a corpus of documents, we believe the combination of high token count and high uniform entropy will identify relatively coherent topics. Moreover, we don’t think using coherence alone is prudent. Coherence doesn’t imply a topic is central to a discussion, and it doesn’t imply a topic has a specific focus.
In contrast, if the goal is to quickly surface unique insights that may not be readily apparent, even after reading many documents, then low token count and high uniform entropy may be better suited. The downside is that these topics may require more contextual cues and effort to understand how the top terms are related.