Using text mining to analyze user forums
2008 International Conference on Service Systems and Service Management (2008)
- ISBN: 9781424416714
- DOI: 10.1109/ICSSSM.2008.4598504
Available from ieeexplore.ieee.org
or
Abstract
Product discussion boards are a rich source of information about consumer sentiment about products, which is being increasingly exploited. Most sentiment analysis has looked at single products in isolation, but users often compare different products, stating which they like better and why. We present a set of techniques for analyzing how consumers view product markets. Specifically, we extracted relative sentiment analysis and comparisons between products, to understand what attributes users compare products on, and which products they prefer on each dimension. We illustrate these methods in an extended case study analyzing the sedan car markets.
Page 2
Using text mining to analyze user forums
into consumer preferences.
The key steps involve tagging the product mentions
(information extraction), extracting snippets containing pairs
of product mentions, and extracting the terms used in product
descriptions of comparisons.
A. Information Extraction
We first extract the brand names (28 car companies) and
model names (180 car models) from the discussion board
messages. Brand names were relatively easy to extract – one
can just string-match on the names – while model names
showed significant variation and ambiguity. Extraction by
string matching on names was not suitable for extracting
models because of the variations in the writing styles. Instead,
a set of regular expressions was developed for recognizing
models from the model-names appearing in these informal
texts [11].
The last decade has seen a profusion of papers on
information extraction [7],[9], mostly using machine learning
methods such as HMMs, CRFs, and other maximum entropy
models, and primarily looking at formally written texts such as
newswire articles or scientific abstracts [12] or partially
structured web pages (e.g., seminar announcements). Entity
extraction and resolution from user reviews is much trickier
than from the corpora used in classic IE research. Product
names vary more than personal names, and clues such as
capitalization that exist in formal text are not reliable in user
reviews.
Although hand-crafting extraction rules is out of fashion,
we believe that there are cases, such as these informal product
reviews, when it is substantially faster to build rules by hand
than to label data, write feature extractors, and train models.
This is particularly true since we want not only to tag terms as
products, but also to resolve which product is being referred
to.
The key to constructing such entity extraction tools is to
note that lists of product brands and models are easy to come
by on the web; the trick is to generalize the formal product
names to the many variants used. Terms referring to products
generally consist of a subset of the brand name, model name,
and model number. Any combination of the three components
can be kept or dropped. Different delimiters (e.g. space,
hyphen, or nothing) can also be used between them. Brand
names are mostly clean string matches to a list, while model
names are often abbreviated (or misspelled). Numbers can be
Roman or Arabic.
Matching is handled in two stages. First there is a tagging
stage that is based on regular expressions and a list of search
strings. This stage is greedy and tries to match as many terms
as possible, giving high recall. After that, a second stage
removes or modifies the falsely extracted entities in order to
improve the precision. Based on a random sample of 500
messages and manual evaluation of the results we achieved
recall of 89.4% and precision of 96.7% leading to F1 of
92.9%.
B. Snippet Extraction
We are interested in seeing which products are compared to
one another, which terms show up most often in connection
with a product or pair of products and what attributes
consumers are using when they compare products. A term
recognizer was built by using a CRF [9] model trained on the
CoNLL-2000 shared task corpus [13]. It uses two consecutive
CRF models - one for part of speech tagging and another for
chunking. After chunking, there is filtering step, which retains
only noun phrases that contain actual nouns (and not,
pronouns, for instance).
The terms extracted for cars fall into several categories such
as: Driving Experience (safety, aerodynamic, suspension,
control, maneuverable, noise, performance, power, reliability,
speed), Short and Long-Term Costs (aging, trade_in,
holds_value, insurance_cost, fuel_economy,
maintenance_expense, price), Internal Construction (brakes,
construction, engine, platform, materials), Internal Comfort
(seats, leg_room, interior_space, higher, size), and External
Design (interior, design, sportive, weight, look, paint,
features).
The snippet extraction component takes as input a large set
of sentences in which the relevant product models are labeled.
The output is a set of snippets – small sentence fragments,
each containing a description of opinion, either factual (e.g.,
“Model A’s X is Y”, or “Model A has X), sentiment-relating
(e.g. “Model A is good”, or “I like model A’s X) or
comparative (e.g., “Model A is better than model B”, or more
generally, “Model A’s X is better than model B”). We focus
here on the comparative snippets.
Snippet extraction is done using the following steps:
Preprocessing:
The texts are tagged with parts of speech (PoS) and
chunked into noun phrases using a CRF-based tagger and
chunker, both trained on CoNLL-2000 shared task training set.
During preprocessing, we also find lists of models (“Models
X, Y, and Z”) and convert them into a single term, so that the
system can more easily extract opinions expressed about
multiple models simultaneously.
Pattern Extraction:
We generate the set of surface patterns by a suitable
modification of the A-priori association mining algorithm.
Each such pattern is a sequence of tokens including the special
slot-mark tokens for product names and (optionally) skips,
which indicate gaps in the pattern. First, we extract all
sequences (without gaps) of tokens, PoS tags, and noun phrase
(NP) chunks that appear in the set of all sentences with
frequency greater than a given minimal support value. Then
we mine the sentences (as ordered sets of such sequences) for
frequent item-sets. The result is the set of all sufficiently
surface patterns.
Pattern Filtering:
In order to reduce the number of irrelevant patterns, we
The key steps involve tagging the product mentions
(information extraction), extracting snippets containing pairs
of product mentions, and extracting the terms used in product
descriptions of comparisons.
A. Information Extraction
We first extract the brand names (28 car companies) and
model names (180 car models) from the discussion board
messages. Brand names were relatively easy to extract – one
can just string-match on the names – while model names
showed significant variation and ambiguity. Extraction by
string matching on names was not suitable for extracting
models because of the variations in the writing styles. Instead,
a set of regular expressions was developed for recognizing
models from the model-names appearing in these informal
texts [11].
The last decade has seen a profusion of papers on
information extraction [7],[9], mostly using machine learning
methods such as HMMs, CRFs, and other maximum entropy
models, and primarily looking at formally written texts such as
newswire articles or scientific abstracts [12] or partially
structured web pages (e.g., seminar announcements). Entity
extraction and resolution from user reviews is much trickier
than from the corpora used in classic IE research. Product
names vary more than personal names, and clues such as
capitalization that exist in formal text are not reliable in user
reviews.
Although hand-crafting extraction rules is out of fashion,
we believe that there are cases, such as these informal product
reviews, when it is substantially faster to build rules by hand
than to label data, write feature extractors, and train models.
This is particularly true since we want not only to tag terms as
products, but also to resolve which product is being referred
to.
The key to constructing such entity extraction tools is to
note that lists of product brands and models are easy to come
by on the web; the trick is to generalize the formal product
names to the many variants used. Terms referring to products
generally consist of a subset of the brand name, model name,
and model number. Any combination of the three components
can be kept or dropped. Different delimiters (e.g. space,
hyphen, or nothing) can also be used between them. Brand
names are mostly clean string matches to a list, while model
names are often abbreviated (or misspelled). Numbers can be
Roman or Arabic.
Matching is handled in two stages. First there is a tagging
stage that is based on regular expressions and a list of search
strings. This stage is greedy and tries to match as many terms
as possible, giving high recall. After that, a second stage
removes or modifies the falsely extracted entities in order to
improve the precision. Based on a random sample of 500
messages and manual evaluation of the results we achieved
recall of 89.4% and precision of 96.7% leading to F1 of
92.9%.
B. Snippet Extraction
We are interested in seeing which products are compared to
one another, which terms show up most often in connection
with a product or pair of products and what attributes
consumers are using when they compare products. A term
recognizer was built by using a CRF [9] model trained on the
CoNLL-2000 shared task corpus [13]. It uses two consecutive
CRF models - one for part of speech tagging and another for
chunking. After chunking, there is filtering step, which retains
only noun phrases that contain actual nouns (and not,
pronouns, for instance).
The terms extracted for cars fall into several categories such
as: Driving Experience (safety, aerodynamic, suspension,
control, maneuverable, noise, performance, power, reliability,
speed), Short and Long-Term Costs (aging, trade_in,
holds_value, insurance_cost, fuel_economy,
maintenance_expense, price), Internal Construction (brakes,
construction, engine, platform, materials), Internal Comfort
(seats, leg_room, interior_space, higher, size), and External
Design (interior, design, sportive, weight, look, paint,
features).
The snippet extraction component takes as input a large set
of sentences in which the relevant product models are labeled.
The output is a set of snippets – small sentence fragments,
each containing a description of opinion, either factual (e.g.,
“Model A’s X is Y”, or “Model A has X), sentiment-relating
(e.g. “Model A is good”, or “I like model A’s X) or
comparative (e.g., “Model A is better than model B”, or more
generally, “Model A’s X is better than model B”). We focus
here on the comparative snippets.
Snippet extraction is done using the following steps:
Preprocessing:
The texts are tagged with parts of speech (PoS) and
chunked into noun phrases using a CRF-based tagger and
chunker, both trained on CoNLL-2000 shared task training set.
During preprocessing, we also find lists of models (“Models
X, Y, and Z”) and convert them into a single term, so that the
system can more easily extract opinions expressed about
multiple models simultaneously.
Pattern Extraction:
We generate the set of surface patterns by a suitable
modification of the A-priori association mining algorithm.
Each such pattern is a sequence of tokens including the special
slot-mark tokens for product names and (optionally) skips,
which indicate gaps in the pattern. First, we extract all
sequences (without gaps) of tokens, PoS tags, and noun phrase
(NP) chunks that appear in the set of all sentences with
frequency greater than a given minimal support value. Then
we mine the sentences (as ordered sets of such sequences) for
frequent item-sets. The result is the set of all sufficiently
surface patterns.
Pattern Filtering:
In order to reduce the number of irrelevant patterns, we
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
1 Reader on Mendeley
by Discipline
by Academic Status
100% Student (Bachelor)
by Country
100% Australia


