A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora

  • Ferreira J
N/ACitations
Citations of this article
41Readers
Mendeley users who have this article in their library.

Abstract

The availability of multi-word units (MWUs) in NLP lexica has important applications: enhances parsing precision, helps on attachment decision and enables more natural interaction of non-specialists users with information retrieval engines, among other applications. Most statistical approaches to MWUs extraction from corpora measure the association between two words, define thresholds for deciding which bigrams may be elected as possible units and use complex linguistic filters and language specific morpho-syntactic rules for filtering those units. In this paper we present: A new algorithm (LocalMaxs) for extracting complex units made up of 2 or more adjacent words (n-grams, with n "2). A new measure of "glue" or association between the words of any size n-gram. An exhaustive comparison of our association measure with other known measures (Loglike,"2, etc.). A new normalization, fair dispersion point normalisation, for current statistical measures (Loglike, "2, etc.) that enhances the precision and recall of the MWUs extracted by these measures.

Cite

CITATION STYLE

APA

Ferreira, J. (1999). A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. Sixth Meeting on Mathematics of Language, 369–381. Retrieved from http://hlt.di.fct.unl.pt/jfs/MOL99.pdf

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free