Measuring the distance between comparable corpora between languages

8Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The notion of comparable corpora rests on our ability to assess the difference between corpora which are claimed to be comparable, but this activity is still art rather than proper science. Here I will discuss attempts at approximating the content of corpora collected from the Web using various methods, also in comparison to traditional corpora, such as the BNC. The procedure for estimating the corpus composition is based on selecting keywords, followed by hard clustering or by building topic models. This can apply to corpora within the same language, e.g., the BNC against ukWac as well as to corpora in different languages, e.g., webpages collected using the same procedure for English and Russian.

Cite

CITATION STYLE

APA

Sharoff, S. (2013). Measuring the distance between comparable corpora between languages. In Building and Using Comparable Corpora (pp. 113–130). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free