The notion of comparable corpora rests on our ability to assess the difference between corpora which are claimed to be comparable, but this activity is still art rather than proper science. Here I will discuss attempts at approximating the content of corpora collected from the Web using various methods, also in comparison to traditional corpora, such as the BNC. The procedure for estimating the corpus composition is based on selecting keywords, followed by hard clustering or by building topic models. This can apply to corpora within the same language, e.g., the BNC against ukWac as well as to corpora in different languages, e.g., webpages collected using the same procedure for English and Russian.
CITATION STYLE
Sharoff, S. (2013). Measuring the distance between comparable corpora between languages. In Building and Using Comparable Corpora (pp. 113–130). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_6
Mendeley helps you to discover research relevant for your work.