Boundary-Based MWE SegmentationWith Text Partitioning

3Citations
Citations of this article
71Readers
Mendeley users who have this article in their library.

Abstract

This work presents a fine-grained, textchunking algorithm designed for the task of multiword expressions (MWEs) segmentation. As a lexical class, MWEs include a wide variety of idioms, whose automatic identification are a necessity for the handling of colloquial language. This algorithm's core novelty is its use of non-word tokens, i.e., boundaries, in a bottom-up strategy. Leveraging boundaries refines token-level information, forging highlevel performance from relatively basic data. The generality of this model's feature space allows for its application across languages and domains. Experiments spanning 19 different languages exhibit a broadly-applicable, stateof- the-art model. Evaluation against recent shared-task data places text partitioning as the overall, best performing MWE segmentation algorithm, covering all MWE classes and multiple English domains (including usergenerated text). This performance, coupled with a non-combinatorial, fast-running design, produces an ideal combination for implementations at scale, which are facilitated through the release of open-source software.

Cite

CITATION STYLE

APA

Williams, J. R. (2017). Boundary-Based MWE SegmentationWith Text Partitioning. In 3rd Workshop on Noisy User-Generated Text, W-NUT 2017 - Proceedings of the Workshop (pp. 1–10). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4401

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free