The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, theWeb is increasingly being used by researchers as a source of written content to build very large and rich corpora, in theWeb as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawledWaC classified by readability level.
CITATION STYLE
Wagner Filho, J. A., Wilkens, R., Zilio, L., Idiart, M., & Villavicencio, A. (2016). Crawling by Readability level. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9727, pp. 306–318). Springer Verlag. https://doi.org/10.1007/978-3-319-41552-9_31
Mendeley helps you to discover research relevant for your work.