Achieving an almost correct PoS-tagged corpus

11Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

After some theoretical discussion on the issue of representativity of a corpus, this paper presents a simple yet very efficient technique serving for (semi-) automatic detection of those positions in a part-of-speech tagged corpus where an error is to be suspected. The approach is based on the idea of learning and application of “invalid bigrams”, i.e. on the search for pairs of adjacent tags which constitute an incorrect configuration in a text of a particular language (in English, e.g., the bigram ARTICLE - VERB). Further, the paper describes the generalization of the “invalid bigrams” into “extended invalid bigrams of length n”, for any natural n, which provides a powerful tool for error detection in a corpus. The approach is illustrated by English, German and Czech examples.

Cite

CITATION STYLE

APA

Květoň, P., & Oliva, K. (2002). Achieving an almost correct PoS-tagged corpus. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 2448, pp. 19–26). Springer Verlag. https://doi.org/10.1007/3-540-46154-x_3

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free