Large linguistically-processed Web corpora for multiple languages

Marco Baroni; Adam Kilgarriff

Conference ProceedingsOPEN ACCESS

Large linguistically-processed Web corpora for multiple languages

EACL 2006 - 11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (2006) 87-90

DOI: 10.3115/1608974.1608976

91Citations

138Readers

Abstract

The Web contains vast amounts of linguistic data. One key issue for linguists and language technologists is how to access it. Commercial search engines give highly compromised access. An alternative is to crawl the Web ourselves, which also allows us to remove duplicates and near-duplicates, navigational material, and a range of other kinds of non-linguistic matter. We can also tokenize, lemmatise and part-of-speech tag the corpus, and load the data into a corpus query tool which supports sophisticated linguistic queries. We have now done this for German and Italian, with corpus sizes of over 1 billion words in each case. We provide Web access to the corpora in our query tool, the Sketch Engine.

Cite

CITATION STYLE

APA

Baroni, M., & Kilgarriff, A. (2006). Large linguistically-processed Web corpora for multiple languages. In EACL 2006 - 11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 87–90). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1608974.1608976

Large linguistically-processed Web corpora for multiple languages

Abstract

Cite

Register to see more suggestions