In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens and is one of the largest freely available linguistic resources for English. The paper describes the tools and methodology used in the construction of the corpus and provides a qualitative evaluation of its contents, carried out through a vocabularybased comparison with the BNC. We conclude by giving practical information about availability and format of the corpus.
CITATION STYLE
Ferraresi, A., Zanchetta, E., Baroni, M., & Bernardini, S. (2008). Introducing and evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as CorpusWorkshop (WAC-4). Can we beat Google? (pp. 47–54).
Mendeley helps you to discover research relevant for your work.