Sign up & Download
Sign in

Automatic labeling of multinomial topic models

by Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining KDD 07 (2007)

Abstract

Multinomial distributions over words are frequently used to model topics in text collections. A common, major challenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automatically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information between a label and a topic model. Experiments with user study have been done on two text data sets with different genres.The results show that the proposed labeling methods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Automatic labeling of multinomial topic models

Automatic Labeling of Multinomial Topic Models
Qiaozhu Mei, Xuehua Shen, Chengxiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana,IL 61801
{qmei2,xshen,czhai}@uiuc.edu
ABSTRACT
Multinomial distributions over words are frequently used to
model topics in text collections. A common, major chal-
lenge in applying all such topic models to any text mining
problem is to label a multinomial topic model accurately so
that a user can interpret the discovered topic. So far, such
labels have been generated manually in a subjective way. In
this paper, we propose probabilistic approaches to automat-
ically labeling multinomial topic models in an objective way.
We cast this labeling problem as an optimization problem
involving minimizing Kullback-Leibler divergence between
word distributions and maximizing mutual information be-
tween a label and a topic model. Experiments with user
study have been done on two text data sets with different
genres. The results show that the proposed labeling meth-
ods are quite effective to generate labels that are meaningful
and useful for interpreting the discovered topic models. Our
methods are general and can be applied to labeling topics
learned through all kinds of topic models such as PLSA,
LDA, and their variations.
Categories and Subject Descriptors: H.3.3 [Informa-
tion Search and Retrieval]: Text Mining
General Terms: Algorithms
Keywords: Statistical topic models, multinomial distribu-
tion, topic model labeling
1. INTRODUCTION
Statistical topic modeling has attracted much attention
recently in machine learning and text mining [11, 4, 28, 22,
9, 2, 16, 18, 14, 24] due to its broad applications, includ-
ing extracting scientific research topics [9, 2], temporal text
mining [17, 24], spatiotemporal text mining [16, 18], author-
topic analysis [22, 18], opinion extraction [28, 16], and infor-
mation retrieval [11, 27, 25]. Common to most of this work
is the idea of using a multinomial word distribution (also
called a unigram language model) to model a topic in text.
For example, the multinomial distribution shown on the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD’07, August 12–15, 2007, San Jose, California, USA.
Copyright 2007 ACM 978-1-59593-609-7/07/0008 ...$5.00.
left side of Table 1 is a topic model extracted from a col-
lection of abstracts of database literature. This model gives
high probabilities to words such as “view”, “materialized”,
and “warehouse,” so it intuitively captures the topic “ma-
terialized view.” In general, a different distribution can be
regarded as representing a different topic.
Many different topic models have been proposed, which
can extract interesting topics in the form of multinomial dis-
tributions automatically from text. Although the discovered
topic word distributions are often intuitively meaningful, a
major challenge shared by all such topic models is to accu-
rately interpret the meaning of each topic. Indeed, it is gen-
erally very difficult for a user to understand a topic merely
based on the multinomial distribution, especially when the
user is not familiar with the source collection. It would be
hard to answer questions such as “What is a topic model
about?” and “How is one distribution different from an-
other distribution of words?”.
Without an automatic way to interpret the semantics of
topics, in existing work of statistical topic modeling, people
generally either select top words in the distribution as prim-
itive labels [11, 4, 9, 2], or generate more meaningful labels
manually in a subjective manner [17, 16, 18, 24]. However,
neither of these options is satisfactory. Consider the follow-
ing topic extracted from a collection of database literature:
Topic Model Variant Labels
views 0.10 Top Terms: views, view, materialized,
view 0.10 maintenance, warehouse, tables
materialized 0.05 Human: materialized view, data warehouse
maintenance 0.05
warehouse 0.03 Single Term: view, maintenance;
tables 0.02 Phrase: data warehouse, view maintenance
summary 0.02 Sentence: Materialized view selection and
updates 0.02 maintenance using multi-query optimization
Table 1: Variant possible labels for a topic model
It is difficult for someone not familiar with the database
domain to infer the meaning of the topic model on the left
just from the top terms. Similar examples can be found in
scientific topics, where extracting top terms is not very use-
ful to interpret the coherent meaning of a topic. For exam-
ple, a topic labeled with “insulin glucose mice diabetes hor-
mone”1 may be a good topic in medical science, but makes
little sense to common audience.
Manual labeling also has its own problems. Although
manually generated labels are usually more understandable
and better capture the semantics of a topic (see Table 1),
it requires a lot of human effort to generate such labels.
1www.cs.cmu.edu/∼lemur/science/topics.html, Topic 26
Page 2
hidden
A more serious problem with manual labeling is that the
labels generated are usually subjective and can easily be bi-
ased towards the user’s personal opinions. Moreover, relying
on human labeling also makes it hard to apply such topic
models to online tasks such as summarizing search results.
Thus it is highly desirable to automatically generate mean-
ingful labels for a topic word distribution so as to facilitate
interpretations of topics. However, to the best of our knowl-
edge, no existing method has been proposed to automati-
cally generate labels for a topic model or a multinomial dis-
tribution of words, other than using a few top words in the
distribution to label a topic. In this paper, we study this
fundamental problem which most statistical topic models
suffer from and propose probabilistic methods to automati-
cally label a topic.
What makes a good label for a topic? Presumably, a
good label should be understandable to the user, could cap-
ture the meaning of the topic, and distinguish a topic from
other topics. In general, there are many possible choices of
linguistic components as topic labels, such as single terms,
phrases, or sentences. However, as we could learn from Ta-
ble 1, single terms are usually too general and it may not
be easy for a user to interpret the combined meaning of the
terms. A sentence, on the other hand, may be too specific,
thus it could not accurately capture the general meaning of
a topic. In between these two extremes, a phrase is coher-
ent and concise enough for a user to understand, while at
the same time, it is also broad enough to capture the over-
all meaning of a topic. Indeed, when labeling topic models
manually, most people prefer phrases [17, 16, 18, 24]. In this
paper, we propose a probabilistic approach to automatically
labeling topic models with meaningful phrases.
Intuitively, in order to choose a label that captures the
meaning of a topic, we must be able to measure the “seman-
tic distance” between a phrase and a topic model, which is
challenging. We solve this problem by representing the se-
mantics of a candidate label with a word distribution and
casting this labeling problem as an optimization problem in-
volving minimizing the Kullback-Leibler divergence between
the topic word distribution and a candidate label word dis-
tribution, which can be further shown to be maximizing mu-
tual information between a label and a topic model.
The proposed methods are evaluated using two text data
sets with different genres (i.e., literature and news). The re-
sults of experiments with user study show that the proposed
labeling methods are quite effective and can automatically
generate labels that are meaningful and useful for interpret-
ing the topic models.
Our methods are general and can be applied to labeling
a topic learned through all kinds of topic models such as
PLSA, LDA, and their variations. Indeed, it can be ap-
plied as a post-processing step to any topic model, as long
as a topic is represented with a multinomial distribution
over words. Moreover, the use of our method is not limited
to labeling topic models; our method can also be used in
any text management tasks where a multinomial distribu-
tion over words can be estimated, such as labeling document
clusters and summarizing text. By switching the context
where candidate labels are extracted and where the seman-
tic distance between a label and a topic is measured, we can
use our method to generate labels that can capture the con-
tent variation of the topics over different contexts, allowing
us to interpret topic models from different views. Thus our
labeling methods also provide an alterative way of solving a
major task of contextual text mining [18].
The rest of the paper is organized as follows. In Sec-
tion 2, we formally define the problem of labeling multi-
nomial topic models. In Section 3, we propose our proba-
bilistic approaches to generating meaningful phrases as topic
labels. The variation of this general method is discussed in
Section 4, followed by empirical evaluation in Section 5, dis-
cussion of related work in Section 6, and our conclusions in
Section 7.
2. PROBLEM FORMULATION
Given a set of latent topics extracted from a text collec-
tion in the form of multinomial distributions, our goal is,
informally, to generate understandable semantic labels for
each topic. We now formally define the problem of topic
model labeling. We begin with a series of useful definitions.
Definition 1 (Topic Model) A topic model θ in a text
collection C is a probability distribution of words {p(w|θ)}w∈V
where V is a vocabulary set. Clearly, we have∑w∈V p(w|θ) =
1.
Intuitively, a topic model can represent a semantically co-
herent topic in the sense that the high probability words
often collectively suggest some semantic theme. For exam-
ple, a topic about “SVM” may assign high probabilities to
words such as “supporting”, “vector” and “kernel.” It is
generally assumed that there are multiple such topic models
in a collection.
Definition 2 (Topic Label) A topic label, or a “label”,
l, for a topic model θ, is a sequence of words which is se-
mantically meaningful and covers the latent meaning of θ.
Words, phrases, and sentences are all valid labels under
this definition. In this paper, however, we only use phrases
as topic labels.
For the example above, a reasonable label may be “sup-
porting vector machine.”
Definition 3 (Relevance Score) The relevance score
of a label to a topic model, s(l, θ), measures the semantic
similarity between the label and the topic model. Given that
l1 and l2 are both meaningful candidate labels, l1 is a better
label for θ than l2 if s(l1, θ) > s(l2, θ).
With these definitions, the problem of Topic Model La-
beling can be defined as follows:
Given a topic model θ extracted from a text collection,
the problem of single topic model labeling is to (1) identify
a set of candidate labels L = {l1, ..., lm}, and (2) design a
relevance scoring function s(li, θ). With L and s, we can
then select a subset of n labels with the highest relevance
scores Lθ = {lθ,1, ..., lθ,n} for θ.
This definition can be generalized to label multiple topics.
Let Θ = {θ1, ..., θk} be a set of k topic models, and L =
{l1, ..., lm} be a set of candidate topic labels. The problem
of multiple topic model labeling is to select a subset of ni
labels, Li = {li,1, ..., li,ni}, for each topic model θi. In most
text mining tasks, we would need to label multiple topics.
In some scenarios, we have a set of well accepted candidate
labels (e.g., the Gene Ontology entries for biological topics).
However, in most cases, we do not have such a candidate
set. More generally, we assume that the set of candidate la-
bels can be extracted from a reference text collection, which
is related to the meaning of the topic models. For exam-
ple, if the topics to be labeled are research themes in data
mining, the reasonable labels could be extracted from the

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

52 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
46% Ph.D. Student
 
13% Student (Master)
 
8% Professor
by Country
 
15% United States
 
13% China
 
10% Canada