Many text mining algorithms and applications require the availability of large text corpora and certain statistics-based annotations. To ensure comparability of results a standardized corpus building process is required. Particularly noteworthy are all pre-processing procedures as they are crucial for the quality of the resulting data stock. This quality can be estimated by both evaluating the corpus building process and by statistical quality measurements on the corpus. Some of these approaches are described using the example of the Leipzig Corpora Collection.
CITATION STYLE
Quasthoff, U., Goldhahn, D., & Eckart, T. (2014). Building Large Resources for Text Mining: The Leipzig Corpora Collection (pp. 3–24). https://doi.org/10.1007/978-3-319-12655-5_1
Mendeley helps you to discover research relevant for your work.