This work presents a fine-grained, textchunking algorithm designed for the task of multiword expressions (MWEs) segmentation. As a lexical class, MWEs include a wide variety of idioms, whose automatic identification are a necessity for the handling of colloquial language. This algorithm's core novelty is its use of non-word tokens, i.e., boundaries, in a bottom-up strategy. Leveraging boundaries refines token-level information, forging highlevel performance from relatively basic data. The generality of this model's feature space allows for its application across languages and domains. Experiments spanning 19 different languages exhibit a broadly-applicable, stateof- the-art model. Evaluation against recent shared-task data places text partitioning as the overall, best performing MWE segmentation algorithm, covering all MWE classes and multiple English domains (including usergenerated text). This performance, coupled with a non-combinatorial, fast-running design, produces an ideal combination for implementations at scale, which are facilitated through the release of open-source software.
CITATION STYLE
Williams, J. R. (2017). Boundary-Based MWE SegmentationWith Text Partitioning. In 3rd Workshop on Noisy User-Generated Text, W-NUT 2017 - Proceedings of the Workshop (pp. 1–10). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w17-4401
Mendeley helps you to discover research relevant for your work.