Sign up & Download
Sign in

Generating and Rendering Readability Scores for Project Gutenberg Texts

by P Reck, R A Reck
Reading (2007)

Cite this document (BETA)

Available from ucrel.lancs.ac.uk
Page 1
hidden

Generating and Rendering Readability Scores for Project Gutenberg Texts

Generating and Rendering Readability Scores
for Project Gutenberg Texts

Ronald P. Reck
RRecktek LLC.
rreck@rrecktek.com
and
Ruth A. Reck
University of California
Davis, CA 95616 U.S.A.
rareck@ucdavis.edu


Abstract

Here the frequency distribution functions have been calculated for seven different types
of readability measurements for over fourteen thousand texts from Project Gutenberg
1
(PG).
Other supporting measurements were also obtained: the average characters per word, the words
per sentence, and the syllables per word.

Three types of distributions have been demonstrated from the analysis of the metadata.
While there are similarities among some of the scores, there is considerable interpretation yet to
be made. The most complex and unique distribution function is found for the Flesch Reading
Ease scores. Because of the computing intensity necessary to obtain these distributions it is only
in the present age of information science that such a broad brush of characterization of a billion
word data source can be made. It is essential that these be sorted by language to better interpret
the meaning of the distributions.

1. Introduction
Various readability measurements can serve as indicators to quantify the relative
accessibility of written information. However, domain specific attributes such as complex
terminology or language can direct readability scores towards higher values than the actual
complexity of the text warrants. For instance, scientific writing is likely to contain long words
that may not significantly increase the complexity of the writing to those familiar with the terms
but make the readability value appear greater. Despite this and other limitations, readability
measurements remain useful attributes for describing text, especially when the values are
regarded as relative measurements from within a specific type of writing or language. This paper
reports the distribution of seven different readability measurements for over fourteen thousand
texts from Project Gutenberg, a collection of free electronic books.

This effort creates the following types of readability measurements: (1) the Automated
Readability Index; (2)Coleman-Liau formula; (3) Flesch Reading Ease Score; (4) the Gunning
Fog Index; (5) the Flesch-Kincaid Score; (6) the Laesbarhedsindex (Lix) score; and (7) the
SMOG score.

1
Project Gutenberg is a library of thousands of free ebooks whose copyright has expired in the USA. It can be found
at http://www.gutenberg.org
Page 2
hidden
Work to identify readability began at least as far back as 1921 in The Teacher’s Word
Book by Thorndike (Thorndike, 1921). Mathematical equations and word frequency were used
to identify a measurement for book difficulty. This process was largely in response to teachers’
requests for science books that taught facts without being encumbered by vocabulary. As part of
the ‘plain language movement’, it supported the idea that clear, unpretentious language can
increase understanding.

In more recent times, efforts for readability have been used by the United States Navy
(Kincaid et al, 1975) Enlisted personnel in training schools were tested to determine their
comprehension level and then training manuals were designed to be within their comprehension
levels.

2. Brief history of rendering Project Gutenberg metadata
Reck’s initial efforts for creating and articulating metadata that describes the Project
Gutenberg repository were first documented in Metadata Cards for Describing Project
Gutenberg Texts (Reck, 2006). That effort involved a process for creating as many as eighteen
attributes for each of 15,511 PG texts thereby producing 912 thousand assertions. Substantial
energy went to accommodating the wide range of variability and poor consistency of PG
formats. A sample metacard, from that effort, which describes the attributes of “A Horse’s Tale”
is shown in Figure 1.

Figure 1: Sample Metacard for ebook ‘A Horse’s Tale’

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
67% Ph.D. Student
 
33% Professor
by Country
 
33% Japan
 
33% Turkey