Web as a Corpus: Going beyond the n-gram

2Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

Cite

CITATION STYLE

APA

Nakov, P. (2015). Web as a Corpus: Going beyond the n-gram. In Communications in Computer and Information Science (Vol. 505, pp. 185–228). Springer Verlag. https://doi.org/10.1007/978-3-319-25485-2_5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free