Crawling by Readability level

5Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, theWeb is increasingly being used by researchers as a source of written content to build very large and rich corpora, in theWeb as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawledWaC classified by readability level.

Cite

CITATION STYLE

APA

Wagner Filho, J. A., Wilkens, R., Zilio, L., Idiart, M., & Villavicencio, A. (2016). Crawling by Readability level. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9727, pp. 306–318). Springer Verlag. https://doi.org/10.1007/978-3-319-41552-9_31

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free