Collective Latent Dirichlet Allocation
2008 Eighth IEEE International Conference on Data Mining (2008)
- ISBN: 9780769535029
- DOI: 10.1109/ICDM.2008.75
Available from ieeexplore.ieee.org
or
Page 1
Collective Latent Dirichlet Allocation
Collective Latent Dirichlet Allocation
Zhi-Yong Shen
1,2
,JunSun
1,2
, and Yi-Dong Shen
1
1
State Key Laboratory of Computer Science,
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
2
Graduated University, Chinese Academy of Sciences, Beijing 100049, China
{zyshen,junsun,ydshen}@ios.ac.cn
Abstract
In this paper, we propose a new variant of Latent Dirich-
let Allocation(LDA): Collective LDA (C-LDA), for multi-
ple corpora modeling. C-LDA combines multiple corpora
during learning such that it can transfer knowledge from
one corpus to another; meanwhile it keeps a discrimina-
tive node which represents the corpus ID to constrain the
learned topics in each corpus. Compared with LDA locally
applied to the target corpus, C-LDA results in refined topic-
word distribution, while compared with applying LDA glob-
ally and straightforwardly to the combined corpus, C-LDA
keeps each topic only for one corpus. We demonstrate that
C-LDA has improved performance with these advantages by
experiments on several benchmark document data sets .
1. Introduction
Modeling the content of documents is a standard task of
information retrieval, text ming and natural language pro-
cessing. Latent Dirichlet Allocation (LDA) based topic
models [2, 11] have attracted much attention recently due
to their ability of discovering the low-dimensional semantic
structures of a corpus. In LDA, the documents are assumed
to be sampled from a mixture distributions over latent top-
ics; meanwhile each topic is characterized by a distribution
over words. By considering a prior probability on these dis-
tributions, LDA establishes a complete generative model for
the corpus. There are dozens of LDA based models includ-
ing: temporal text mining [14], author- topic analysis [12],
supervised topic models [1], latent Dirichlet co-clustering
[10] and LDA based bio-informatics [3]. Most of these
models are designed for a single corpus while in practice,
we always face numerous corpora such as newsgroups, web
pages and scientific papers. In this paper, we consider how
LDA can be used to model multiple corpora collectively.
Transfer learning is an hot area in machine learning and
data mining domains recently, which emphasizes the trans-
ferring of knowledge across different domains or tasks. The
performance of learning models can be improved by knowl-
edge transferred from extra (even can be irrelevant) auxil-
iary data sets. For example, Wu and Dietterich [15] pro-
pose how to adjusting SVM classifiers with auxiliary data
sources. Raina et al. [9] investigate learning logistic regres-
sion classifiers by incorporating labeled data from irrelevant
categories through constructing informative prior from the
irrelevant labeled data. Raina et al. [8] propose a new learn-
ing technique − self-taught learning which uses irrelevant
unlabeled data to enhance the classification performance.
As mentioned in [9], LDA may also model a corpus bet-
ter with auxiliary corpora. This is realistic in the human
reading behavior since readers always use knowledge across
the reading domains. A straightforward application of LDA
on multiple corpora is to combine the target corpus with
auxiliary corpora and treat the combination as a single cor-
pus. Although by this means, knowledge in each corpus can
be transferred to the others, there are some shortcomings:
First, the supervised information − from which corpus a
document comes, is discarded. Second, the learned topics
are across all corpora, which may not satisfy the learning
objective that the learned topics should be specific to the
target corpus.
In this paper, we propose a new variant of LDA: Collec-
tive LDA (C-LDA), for multiple corpora modeling. C-LDA
combine multiple corpora when learning such that it can
transfer knowledge from one corpus to another; meanwhile
it keeps a discriminative node which represents the corpus
ID to constrain the learned topics in target corpus. Com-
pared with LDA locally applied on the target corpus, C-
LDA results in refined topic-word distribution, while com-
pared with applying LDA globally and straightforwardly on
the combined corpus, C-LDA keeps learned topics only for
the target corpus. By experiments on several benchmark
document data sets, we demonstrate that C-LDA has signif-
icantly improved performance with these advantages. Teh
et al. [13] propose Hierarchical Dirichlet Processes (HDP)
which also can learn topics from multiple corpora. How-
Zhi-Yong Shen
1,2
,JunSun
1,2
, and Yi-Dong Shen
1
1
State Key Laboratory of Computer Science,
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
2
Graduated University, Chinese Academy of Sciences, Beijing 100049, China
{zyshen,junsun,ydshen}@ios.ac.cn
Abstract
In this paper, we propose a new variant of Latent Dirich-
let Allocation(LDA): Collective LDA (C-LDA), for multi-
ple corpora modeling. C-LDA combines multiple corpora
during learning such that it can transfer knowledge from
one corpus to another; meanwhile it keeps a discrimina-
tive node which represents the corpus ID to constrain the
learned topics in each corpus. Compared with LDA locally
applied to the target corpus, C-LDA results in refined topic-
word distribution, while compared with applying LDA glob-
ally and straightforwardly to the combined corpus, C-LDA
keeps each topic only for one corpus. We demonstrate that
C-LDA has improved performance with these advantages by
experiments on several benchmark document data sets .
1. Introduction
Modeling the content of documents is a standard task of
information retrieval, text ming and natural language pro-
cessing. Latent Dirichlet Allocation (LDA) based topic
models [2, 11] have attracted much attention recently due
to their ability of discovering the low-dimensional semantic
structures of a corpus. In LDA, the documents are assumed
to be sampled from a mixture distributions over latent top-
ics; meanwhile each topic is characterized by a distribution
over words. By considering a prior probability on these dis-
tributions, LDA establishes a complete generative model for
the corpus. There are dozens of LDA based models includ-
ing: temporal text mining [14], author- topic analysis [12],
supervised topic models [1], latent Dirichlet co-clustering
[10] and LDA based bio-informatics [3]. Most of these
models are designed for a single corpus while in practice,
we always face numerous corpora such as newsgroups, web
pages and scientific papers. In this paper, we consider how
LDA can be used to model multiple corpora collectively.
Transfer learning is an hot area in machine learning and
data mining domains recently, which emphasizes the trans-
ferring of knowledge across different domains or tasks. The
performance of learning models can be improved by knowl-
edge transferred from extra (even can be irrelevant) auxil-
iary data sets. For example, Wu and Dietterich [15] pro-
pose how to adjusting SVM classifiers with auxiliary data
sources. Raina et al. [9] investigate learning logistic regres-
sion classifiers by incorporating labeled data from irrelevant
categories through constructing informative prior from the
irrelevant labeled data. Raina et al. [8] propose a new learn-
ing technique − self-taught learning which uses irrelevant
unlabeled data to enhance the classification performance.
As mentioned in [9], LDA may also model a corpus bet-
ter with auxiliary corpora. This is realistic in the human
reading behavior since readers always use knowledge across
the reading domains. A straightforward application of LDA
on multiple corpora is to combine the target corpus with
auxiliary corpora and treat the combination as a single cor-
pus. Although by this means, knowledge in each corpus can
be transferred to the others, there are some shortcomings:
First, the supervised information − from which corpus a
document comes, is discarded. Second, the learned topics
are across all corpora, which may not satisfy the learning
objective that the learned topics should be specific to the
target corpus.
In this paper, we propose a new variant of LDA: Collec-
tive LDA (C-LDA), for multiple corpora modeling. C-LDA
combine multiple corpora when learning such that it can
transfer knowledge from one corpus to another; meanwhile
it keeps a discriminative node which represents the corpus
ID to constrain the learned topics in target corpus. Com-
pared with LDA locally applied on the target corpus, C-
LDA results in refined topic-word distribution, while com-
pared with applying LDA globally and straightforwardly on
the combined corpus, C-LDA keeps learned topics only for
the target corpus. By experiments on several benchmark
document data sets, we demonstrate that C-LDA has signif-
icantly improved performance with these advantages. Teh
et al. [13] propose Hierarchical Dirichlet Processes (HDP)
which also can learn topics from multiple corpora. How-
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
14 Readers on Mendeley
by Discipline
14% Mathematics
by Academic Status
50% Ph.D. Student
14% Student (Master)
7% Post Doc
by Country
36% China
14% Japan
7% Netherlands


