An Unsupervised Approach for the Detection of Outliers in Corpora

  • Guthrie D
  • Guthrie L
  • Wilks Y
  • 11

    Readers

    Mendeley users who have this article in their library.
  • 2

    Citations

    Citations of this article.

Abstract

Many applications of computational linguistics are greatly influenced by the quality of corpora available and as automatically generated corpora continue to play an increasingly common role, it is essential that we not overlook the importance of well-constructed and homogeneous corpora. This paper describes an automatic approach to improving the homogeneity of corpora using an unsupervised method of statistical outlier detection to find documents and segments that do not belong in a corpus. We consider collections of corpora that are homogeneous with respect to topic (i.e. about the same subject), or genre (written for the same audience or from the same source) and use a combination of stylistic and lexical features of the texts to automatically identify pieces of text in these collections that break the homogeneity. These pieces of text that are significantly different from the rest of the corpus are likely to be errors that are out of place and should be removed from the corpus before it is used for other tasks. We evaluate our techniques by running extensive experiments over large artificially constructed corpora that each contain single pieces of text from a different topic, author, or genre than the rest of the collection and measure the accuracy of identifying these pieces of text without the use of training data. We show that when these pieces of text are reasonably large (1,000 words) we can reliably identify them in a corpus.

Get free article suggestions today

Mendeley saves you time finding and organizing research

Sign up here
Already have an account ?Sign in

Find this document

  • SGR: 84929594810
  • ISBN: 2951740840
  • PUI: 619617039
  • SCOPUS: 2-s2.0-84929594810

Authors

  • David Guthrie

  • Louise Guthrie

  • Yorick Wilks

Cite this document

Choose a citation style from the tabs below

Save time finding and organizing research with Mendeley

Sign up for free