The availability of multi-word units (MWUs) in NLP lexica has important applications: enhances parsing precision, helps on attachment decision and enables more natural interaction of non-specialists users with information retrieval engines, among other applications. Most statistical approaches to MWUs extraction from corpora measure the association between two words, define thresholds for deciding which bigrams may be elected as possible units and use complex linguistic filters and language specific morpho-syntactic rules for filtering those units. In this paper we present: A new algorithm (LocalMaxs) for extracting complex units made up of 2 or more adjacent words (n-grams, with n "2). A new measure of "glue" or association between the words of any size n-gram. An exhaustive comparison of our association measure with other known measures (Loglike,"2, etc.). A new normalization, fair dispersion point normalisation, for current statistical measures (Loglike, "2, etc.) that enhances the precision and recall of the MWUs extracted by these measures.
CITATION STYLE
Ferreira, J. (1999). A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. Sixth Meeting on Mathematics of Language, 369–381. Retrieved from http://hlt.di.fct.unl.pt/jfs/MOL99.pdf
Mendeley helps you to discover research relevant for your work.