Web as a Corpus: Going beyond the n-gram

Preslav Nakov

Conference Proceedings

Web as a Corpus: Going beyond the n-gram

Nakov P

Communications in Computer and Information Science (2015) 505 185-228

DOI: 10.1007/978-3-319-25485-2_5

2Citations

10Readers

Get full text

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

Author supplied keywords

Cite

CITATION STYLE

APA

Nakov, P. (2015). Web as a Corpus: Going beyond the n-gram. In Communications in Computer and Information Science (Vol. 505, pp. 185–228). Springer Verlag. https://doi.org/10.1007/978-3-319-25485-2_5

Web as a Corpus: Going beyond the n-gram

Abstract

Author supplied keywords

Cite

Register to see more suggestions