Sign up & Download
Sign in

Automatic Extraction of Semantic Relations from Specialized Corpora

by Aristomenis Thanopoulos, Nikos Fakotakis, George Kokkinakis
Proceedings of the 18th conference on Computational linguistics (2000)

Abstract

In this paper we address the problem of discovering word semantic similarities via statistical processing of text corpora. We propose a knowledge-poor method that exploits the sentencial context of words for extracting similarity relations between them as well as semantic in nature word clusters. The approach aims at full portability across domains and languages and therefore is based on minimal resources.

Cite this document (BETA)

Available from citeseerx.ist.psu.edu
Page 1
hidden

Automatic Extraction of Semantic Relations from Specialized Corpora


Page 2
hidden
nmtual information and the perplexity
improvement respectively. Such methods are
oriented to language modeling and aim primarily
at rough but fast clustering of large vocabularies.
Brown et al. (1992) also proposed a window
method introducing the concept of "semantic
stickiness" of two words as the relatively frequent
close occurrence between them (less than 500
words distance). Although this is an efficient and
entirely knowledge-poor method tbr extracting
both semantic relations and clusters, the extracted
relations are not restricted to semantic similarity
but extend on thematic roles. Moreover its
applicability to small and specialized corpora is
uncertain.
3 A knowledge-poor approach
In order to achieve portability we approach the
issue from a knowledge-poor perspective. Syntax-
based methods employ partial parsers which
require highly language-dependent resources
(morphological/grammatical analysis), and/or
properly tagged training corpus in order to detect
syntactic relations between sentence constituents.
On the other hand, n-gram methods operate on
large corpora and, in order to reduce
computational resources, consider as context
words only the immediately adjacent ones.
Medium-distance word context is not exploited.
Since large corpora are available only for few
domains we aimed at developing a method for
processing small or medium sized corpora
exploiting the most of contextual information, that
is, the tifll sentential context of words. Our
approach was driven by the observation that in
domain-constrained corpora, unlike fiction or
general journalese, the vocabulary is limited, the
syntactic structures are not complex and that
medium-distance lexical patterns are frequently
used to express imilar facts.
Specifically we have developed two different
algorithms in respect o the context consideration
they employ: Word-based and Pattern-based. The
former acquires word-based contextual data
(extended up to sentence boundaries), according to
the distributional similarity of which, word
similarity relations are extracted. The latter detects
common patterns throughout he corpus that
indicate possible word similarities. For example,
consider the sentence fragments:
"...while the S&I' index inched up 0.3%."
"The DAX #Mex inched up O. 70 point to close..."
Although their syntactic structures are different,
the common contextual pattern (appearing beyond
immediately adjacent words) indicates a possible
similarity between the tokens 'S&P' and 'DAX'.
Word pairs that persistently appear such context
similarity throughout the corpus (frequently
observed in technical texts) are confidently
indicated as semantically similar. Our method
captures such context similarity and extracts a
proportionate measure about semantic similarity
between lexical items.
Most approaches (Brown et al., 1992; Li & Abe,
1997) inherently extract semantic knowledge in
the abstracted form of semantic clusters. Our
method produces emantic similarity relations as
an intermediate (and information-richer)
semantics representation formalism, from which
cluster hierarchies can be generated. Of great
importance is that soft clustering methods can also
be applied to this set of relations and cluster
polysemous words to more than one classes.
Stock lnarket-financial news and Modem Greek,
were used as domain and language test case
respectively, ltowever demonstrative xamples
taken from the WSJ corpus have been used
throughout the paper as well.
4 Context Similarity Estimation
The main idea supporting context-based word
clustering is that two words that can substitute one
another in several different contexts always
providing meaningful word sequences are
probably semantically similar. Present n-gram
based methods utilize this assumption considering
as a context of a focus word only the one or two
immediately adjacent parameter words.
In the present work, we consider as word context
the whole sentence in which the examined word
appears, excluding only the semantically empty
(i.e. functional) words such as articles,
conjunctions, particles, auxiliaries. Adopting this
word context notion we proceed to tile ibllowing
analysis:
Let us consider a text corpus Tc with vocabulary
Vc and Vs _c Vc the set of words that are of interest
in extracting semantic similarity relations between
them. Vx comprises the non-functional words of
837

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

7 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
43% Student (Master)
 
29% Ph.D. Student
 
14% Associate Professor
by Country
 
29% Brazil
 
14% Germany
 
14% China