Automatic topic segmentation and labeling in multiparty dialogue
Abstract
This study concerns how to segment a scenario-driven multiparty dialogue and how to label these segments automatically. We apply approaches that have been proposed for identifying topic boundaries at a coarser level to the problem of identifying agenda-based topic boundaries in scenario-based meetings. We also develop conditional models to classify segments into topic classes. Experiments in topic segmentation show that a supervised classification approach that combines lexical and conversational features outperforms the unsupervised lexical chain-based approach, achieving 20% and 12% improvement on segmentating top-level and sub-topic segments respectively. Experiments in topic classification suggest that it is possible to automatically categorize segments into appropriate topic classes given only the transcripts. Training with features selected using the Log Likelihood ratio improves the results by 13.3%.
Author-supplied keywords
Automatic topic segmentation and labeling in multiparty dialogue
Pei-Yun Hsueh and Johanna D. Moore
School of Informatics
University of Edinburgh
Edinburgh, EH8 9LW, GB
ABSTRACT
This study concerns how to segment a scenario-driven mul-
tiparty dialogue and how to label these segments automati-
cally. We apply approaches that have been proposed for iden-
tifying topic boundaries at a coarser level to the problem of
identifying agenda-based topic boundaries in scenario-based
meetings. We also develop conditional models to classify seg-
ments into topic classes. Experiments in topic segmentation
show that a supervised classification approach that combines
lexical and conversational features outperforms the unsuper-
vised lexical chain-based approach, achieving 20% and 12%
improvement on segmentating top-level and sub-topic seg-
ments respectively. Experiments in topic classification sug-
gest that it is possible to automatically categorize segments
into appropriate topic classes given only the transcripts. Train-
ing with features selected using the Log Likelihood ratio im-
proves the results by 13.3%.
1. INTRODUCTION
This study concerns the problem of segmenting a conversa-
tion record into a number of smaller segments and that of
classifying each locally coherent segment into topic classes.
Our interest in the problem is two-fold: First, topic segmenta-
tion and labeling provides the right level of detail for users to
interpret what has transpired and locate relevant information
in a multiparty dialogue. For example, upper management
can efficiently locate critical deicsions made in a product de-
sign meeting by browsing the topic hierarchies. Second, it
can lend support to the development of computer supported
collaborative work applications, where group meeting records
are automatically processed in order to extract information for
summarization, question answering and providing thumbnail
views on mobile devices.
2. RELATED WORK
Past research has explored the effect of a variety of features on
characterizing topic boundaries. For example, [1] has studied
lexical cohesion and proposed the TextTiling algorithm, an
unsupervised approach that hypothesizes boundaries as points
where the lexical cohesion score changes significantly. [2]
and [3] have also used lexical cohesion to hypothesize seg-
ment boundaries in broadcast news transcripts and sponta-
neous speech. Recent advances in statistical text classifica-
tion have inspired researches to cast the segmentation task as
a binary classification task. Various combinations of features
have been proposed to train the classification models, e.g.,
prosodic cues [4, 5], lexical features (N-grams) and discourse
cues [6], lexical cohesion and conversational features [3].
[3] has applied a supervised classification approach that
combines knowledge from various sources to identify top-
level boundaries in meetings of the ICSI corpus. [7] has stud-
ied the problem of predicting topic boundaries at different lev-
els of granularity and showed that the supervised classifica-
tion approach performs better on predicting a coarser level of
topic segmentation. As we would like to understand whether
this finding is generalizable to agenda-based topic segmenta-
tions, this study applies these approaches to the problem of
identifying topic boundaries in the scenario-driven meetings
of the AMI corpus.
The task of topic labeling is a task complementary to that
of topic segmentation. Prior research has proposed model-
ing topics explicitly using generative models, in which a col-
lection of mutually independent observations are probabilis-
tically generated by a hidden topic variable [8, 9]. Gener-
ative topic models can also be used to hypothesize segment
boundaries where the value of the topic variable for the next
observation changes. Other research has proposed merging
similar utterances into topic clusters using unsupervised clus-
tering approaches that minimize inter-cluster similarity and
maximize intra-cluster similarity [10, 11].
3. METHODOLOGY
3.1. Topic Segmentation
In this study, we compare two segmentation approaches: (1)
an unsupervised lexical cohesion-based algorithm (LCseg) us-
ing solely lexical cohesion information, and (2) a supervised
classification approach that trains decision trees (C4.5) on a
combination of lexical cohesion and conversational features.
The first approach, LCSeg, hypothesizes that a major topic
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


