Crawling by Readability level

Jorge A. Wagner Filho; Rodrigo Wilkens; Leonardo Zilio; Marco Idiart; Aline Villavicencio

Conference Proceedings

Crawling by Readability level

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9727 306-318

DOI: 10.1007/978-3-319-41552-9_31

5Citations

11Readers

Get full text

Abstract

The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, theWeb is increasingly being used by researchers as a source of written content to build very large and rich corpora, in theWeb as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawledWaC classified by readability level.

Author supplied keywords

Cite

CITATION STYLE

APA

Wagner Filho, J. A., Wilkens, R., Zilio, L., Idiart, M., & Villavicencio, A. (2016). Crawling by Readability level. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9727, pp. 306–318). Springer Verlag. https://doi.org/10.1007/978-3-319-41552-9_31

Crawling by Readability level

Abstract

Author supplied keywords

Cite

Register to see more suggestions