Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day

1Citations
Citations of this article
87Readers
Mendeley users who have this article in their library.

Abstract

This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one person-day of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of context-window feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort.

Cite

CITATION STYLE

APA

Cucerzan, S., & Yarowsky, D. (2002). Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL). https://doi.org/10.3115/1118853.1118859

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free