The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on subproblems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field. Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.
CITATION STYLE
Nakov, P. (2015). Web as a Corpus: Going beyond the n-gram. In Communications in Computer and Information Science (Vol. 505, pp. 185–228). Springer Verlag. https://doi.org/10.1007/978-3-319-25485-2_5
Mendeley helps you to discover research relevant for your work.