Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
Intelligence (2008)
- ISBN: 0974903949
Available from www.cs.umass.edu
or
Abstract
Although fully generative models have been successfully used to model the contents of text documents, they are often awkward to apply to combinations of text data and document metadata. In this paper we propose a Dirichlet-multinomial regression (DMR) topic model that includes a log-linear prior on document-topic distributions that is a function of observed features of the document, such as author, publication venue, references, and dates. We show that by selecting appropriate features, DMR topic models can meet or exceed the performance of several previously published topic models designed for specific data. 1
Available from www.cs.umass.edu
Page 1
Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression
Topic Models Conditioned on Arbitrary Features with
Dirichlet-multinomial Regression
David Mimno
Computer Science Dept.
University of Massachusetts, Amherst
Amherst, MA 01003
Andrew McCallum
Computer Science Dept.
University of Massachusetts, Amherst
Amherst, MA 01003
Abstract
Although fully generative models have been
successfully used to model the contents of
text documents, they are often awkward to
apply to combinations of text data and doc-
ument metadata. In this paper we propose
a Dirichlet-multinomial regression (DMR)
topic model that includes a log-linear prior on
document-topic distributions that is a func-
tion of observed features of the document,
such as author, publication venue, references,
and dates. We show that by selecting ap-
propriate features, DMR topic models can
meet or exceed the performance of several
previously published topic models designed
for specic data.
1 Introduction
Bayesian multinomial mixture models such as latent
Dirichlet allocation (LDA) [3] have become a popular
method in text analysis due to their simplicity, their
usefulness in reducing the dimensionality of the data,
and their ability to produce interpretable and seman-
tically coherent topics.
Text data is generally accompanied by metadata, in-
cluding authors, publication venues, and dates. Many
extensions have been proposed to the basic mixture-
of-multinomials topic model to take this data into ac-
count. The goal of these extensions is generally two-
fold. The rst motivation is to learn better topics using
the additional information. The second is to discover
associations and patterns, such as learning a topical
prole of a given author, or plotting a timeline of the
rise and fall of a topic.
The simplest method of incorporating metadata in
generative topic models is to generate both the words
and the metadata simultaneously given hidden topic
variables. In this type of model, each type of model
has a distribution over words as in the standard model,
as well as a distribution over metadata values. Exam-
ples of such models include the authorship model of
Erosheva, Fienberg and Laerty [5], the Topics over
Time (TOT) model of Wang and McCallum [15], the
CorrLDA model of Blei and Jordan [1] and the named
entity models of Newman, Chemudugunta and Smyth
[11].
One of the most
exible members of this family is the
supervised latent Dirichlet allocation (sLDA) model
of Blei and McAulie [2]. sLDA generates metadata
such as reviewer ratings by learning the parameters of
a generalized linear model (GLM) with an appropriate
link function and exponential family dispersion func-
tion, which are specied by the modeler, for each type
of metadata. We show in Section 4.3 that the TOT
model is an example of sLDA.
D
α zθ w
m
T
φ
ψ
β
γ
Figure 1: Graphical model representation of a \down-
stream" topic model, in which metadata m is generated
conditioned on the topic assignment variables z of the doc-
ument and each topic has some parametric distribution
over metadata values.
Another approach involves rst generating metadata
elements and then generating topic variables condi-
tioned on those elements. One example of this type
of model is the author-topic model of Rosen-Zvi, Grif-
ths, Steyvers and Smyth [12]. In this model, words
are generated by rst selecting an author uniformly
from an observed author list and then selecting a topic
from a distribution over topics that is specic to that
author. Given a topic, words are selected as before.
This model assumes that each word is generated by
Dirichlet-multinomial Regression
David Mimno
Computer Science Dept.
University of Massachusetts, Amherst
Amherst, MA 01003
Andrew McCallum
Computer Science Dept.
University of Massachusetts, Amherst
Amherst, MA 01003
Abstract
Although fully generative models have been
successfully used to model the contents of
text documents, they are often awkward to
apply to combinations of text data and doc-
ument metadata. In this paper we propose
a Dirichlet-multinomial regression (DMR)
topic model that includes a log-linear prior on
document-topic distributions that is a func-
tion of observed features of the document,
such as author, publication venue, references,
and dates. We show that by selecting ap-
propriate features, DMR topic models can
meet or exceed the performance of several
previously published topic models designed
for specic data.
1 Introduction
Bayesian multinomial mixture models such as latent
Dirichlet allocation (LDA) [3] have become a popular
method in text analysis due to their simplicity, their
usefulness in reducing the dimensionality of the data,
and their ability to produce interpretable and seman-
tically coherent topics.
Text data is generally accompanied by metadata, in-
cluding authors, publication venues, and dates. Many
extensions have been proposed to the basic mixture-
of-multinomials topic model to take this data into ac-
count. The goal of these extensions is generally two-
fold. The rst motivation is to learn better topics using
the additional information. The second is to discover
associations and patterns, such as learning a topical
prole of a given author, or plotting a timeline of the
rise and fall of a topic.
The simplest method of incorporating metadata in
generative topic models is to generate both the words
and the metadata simultaneously given hidden topic
variables. In this type of model, each type of model
has a distribution over words as in the standard model,
as well as a distribution over metadata values. Exam-
ples of such models include the authorship model of
Erosheva, Fienberg and Laerty [5], the Topics over
Time (TOT) model of Wang and McCallum [15], the
CorrLDA model of Blei and Jordan [1] and the named
entity models of Newman, Chemudugunta and Smyth
[11].
One of the most
exible members of this family is the
supervised latent Dirichlet allocation (sLDA) model
of Blei and McAulie [2]. sLDA generates metadata
such as reviewer ratings by learning the parameters of
a generalized linear model (GLM) with an appropriate
link function and exponential family dispersion func-
tion, which are specied by the modeler, for each type
of metadata. We show in Section 4.3 that the TOT
model is an example of sLDA.
D
α zθ w
m
T
φ
ψ
β
γ
Figure 1: Graphical model representation of a \down-
stream" topic model, in which metadata m is generated
conditioned on the topic assignment variables z of the doc-
ument and each topic has some parametric distribution
over metadata values.
Another approach involves rst generating metadata
elements and then generating topic variables condi-
tioned on those elements. One example of this type
of model is the author-topic model of Rosen-Zvi, Grif-
ths, Steyvers and Smyth [12]. In this model, words
are generated by rst selecting an author uniformly
from an observed author list and then selecting a topic
from a distribution over topics that is specic to that
author. Given a topic, words are selected as before.
This model assumes that each word is generated by
Page 2
one and only one author. Similar models, in which
a hidden variable selects one of several multinomials
over topics, are presented by Mimno and McCallum
[10] for discovering topical foci for individual authors
and by Dietz, Bickel and Scheer [4] for inferring the
in
uence of individual references on citing papers.
A
D
α
z
θ
w
T
φ βaη
Figure 2: An example of an \upstream" topic model
(Author-Topic). The observed authors determine a uni-
form distribution over authors. Each word is generated
by selecting an author, a, then selecting a topic from that
author's topic distribution a, and nally selecting a word
from that topic's word distribution.
Previous work in metadata-rich topic modeling has
focused either on specially constructed models that
cannot accommodate modalities of data beyond their
original intention, or more complicated models such
as exponential family harmoniums and sLDA, whose
exibility comes at the cost of increasingly intractable
inference. In this paper, we propose a new method
for modeling the in
uence of observed non-word fea-
tures of documents, Dirichlet-multinomial regression
(DMR) topic models. In contrast to previous meth-
ods, DMR topic models are able to incorporate ar-
bitrary types of observed features with no additional
work, yet inference remains relatively simple.
In section 4 we present comparisons of several topic
models designed for specic types of metadata to DMR
models conditioned on features that emulate those
models. We show that performance of DMR models
is at least no worse than similar generative models,
and can be considerably better. This gap grows as the
richness of the features increases.
2 Modeling the in
uence of document
metadata with
Dirichlet-multinomial regression
For each document d, let xd be a vector containing
values for each feature. For example, if the observed
features are indicators for the presence of authors, then
xd would include a 1 in the positions for each author
listed on document d, and a 0 otherwise. In addi-
tion, to account for the mean value of each topic, we
include an intercept term or default feature that is al-
ways equal to 1.
For each topic t, we also have a vector t, with length
the number of features.
1. For each topic t,
(a) Draw t N (0; 2I)
(b) Draw t D()
2. For each document d,
(a) For each topic t let dt = exp(xTd t).
(b) Draw d D(d).
(c) For each word i,
i. Draw zi M(d).
ii. Draw wi M(zi).
The model therefore includes three xed parameters:
2, the variance of the prior on parameter values; ,
the Dirichlet prior on the topic-word distributions; and
jT j, the number of topics.
Integrating over the multinomials and , we can
construct the complete log likelihood:
P (w; z;) = (1)
Y
d
(
P
t exp(x
T
d t))
(
P
t exp(x
T
d t) + nd)
Y
t
(exp(xTd t) + ntjd)
(exp(xTd t))
Y
t;k
1
p
2
exp
2tk
22
:
The derivative of the log of Equation 1 with respect to
the parameter tk for a given topic t and feature k is
therefore
@`
@tk
= (2)
X
d
xdk exp(xTd t)
X
t
exp(xTd t)
X
t
exp(xTd t) + nd
+
exp(xTd t) + ntjd
exp(xTd t)
tk
2
:
We train this model using a stochastic EM sampling
scheme, in which we alternate between sampling topic
assignments from the current prior distribution condi-
tioned on the observed words and features, and nu-
merically searching for the MAP parameters of the
GLM given the topic assignments. Our implementa-
tion is based on the standard L-BFGS optimizer [8]
and Gibbs sampling-based LDA trainer in the Mallet
toolkit [9].
a hidden variable selects one of several multinomials
over topics, are presented by Mimno and McCallum
[10] for discovering topical foci for individual authors
and by Dietz, Bickel and Scheer [4] for inferring the
in
uence of individual references on citing papers.
A
D
α
z
θ
w
T
φ βaη
Figure 2: An example of an \upstream" topic model
(Author-Topic). The observed authors determine a uni-
form distribution over authors. Each word is generated
by selecting an author, a, then selecting a topic from that
author's topic distribution a, and nally selecting a word
from that topic's word distribution.
Previous work in metadata-rich topic modeling has
focused either on specially constructed models that
cannot accommodate modalities of data beyond their
original intention, or more complicated models such
as exponential family harmoniums and sLDA, whose
exibility comes at the cost of increasingly intractable
inference. In this paper, we propose a new method
for modeling the in
uence of observed non-word fea-
tures of documents, Dirichlet-multinomial regression
(DMR) topic models. In contrast to previous meth-
ods, DMR topic models are able to incorporate ar-
bitrary types of observed features with no additional
work, yet inference remains relatively simple.
In section 4 we present comparisons of several topic
models designed for specic types of metadata to DMR
models conditioned on features that emulate those
models. We show that performance of DMR models
is at least no worse than similar generative models,
and can be considerably better. This gap grows as the
richness of the features increases.
2 Modeling the in
uence of document
metadata with
Dirichlet-multinomial regression
For each document d, let xd be a vector containing
values for each feature. For example, if the observed
features are indicators for the presence of authors, then
xd would include a 1 in the positions for each author
listed on document d, and a 0 otherwise. In addi-
tion, to account for the mean value of each topic, we
include an intercept term or default feature that is al-
ways equal to 1.
For each topic t, we also have a vector t, with length
the number of features.
1. For each topic t,
(a) Draw t N (0; 2I)
(b) Draw t D()
2. For each document d,
(a) For each topic t let dt = exp(xTd t).
(b) Draw d D(d).
(c) For each word i,
i. Draw zi M(d).
ii. Draw wi M(zi).
The model therefore includes three xed parameters:
2, the variance of the prior on parameter values; ,
the Dirichlet prior on the topic-word distributions; and
jT j, the number of topics.
Integrating over the multinomials and , we can
construct the complete log likelihood:
P (w; z;) = (1)
Y
d
(
P
t exp(x
T
d t))
(
P
t exp(x
T
d t) + nd)
Y
t
(exp(xTd t) + ntjd)
(exp(xTd t))
Y
t;k
1
p
2
exp
2tk
22
:
The derivative of the log of Equation 1 with respect to
the parameter tk for a given topic t and feature k is
therefore
@`
@tk
= (2)
X
d
xdk exp(xTd t)
X
t
exp(xTd t)
X
t
exp(xTd t) + nd
+
exp(xTd t) + ntjd
exp(xTd t)
tk
2
:
We train this model using a stochastic EM sampling
scheme, in which we alternate between sampling topic
assignments from the current prior distribution condi-
tioned on the observed words and features, and nu-
merically searching for the MAP parameters of the
GLM given the topic assignments. Our implementa-
tion is based on the standard L-BFGS optimizer [8]
and Gibbs sampling-based LDA trainer in the Mallet
toolkit [9].
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
15 Readers on Mendeley
by Discipline
by Academic Status
47% Ph.D. Student
20% Post Doc
13% Researcher (at a non-Academic Institution)
by Country
40% United States
20% Australia
7% Italy


