Papers in this group
-
The Digital Special Collections at the Leiden University Library offers an extensive selection of digitised manuscripts, letters, early printed and rare edition books. Currently most of this material can only be viewed in picture format. Thus the…
-
The ANDP team had from the outset in January 2007 decided to make a considerable investment in software development in order to be able to quality assure digital outputs to ensure they met minimum standards and to future proof the files in case they…
-
Most methods for document image retrieval rely solely on text information to find similar documents. This paper describes a way to use layout information for document image retrieval instead. A new class of distance measures is introduced for…
-
-
A longest-common-subsequence algorithm is described which operates in terms of bit or bit-string operations. It offers a speedup of the order of the word-length on a conventional computer.
-
Organizing unstructured information from books into a well-defined structure is a significant challenge in digital libraries. Most digital libraries can provide only search services at the granularity of books and few libraries allow books to be…
-
moz-hocr-edit provides a line-by-line interface for people to proofread the results of the Optical Character Recognition (OCR) process. OCR programs are not perfect at recognizing text, so human editing is often necessary.
-
With the advent of more powerful personal computers, inexpensive memory and digital cameras, curators around the world are working towards preserving historical documents on computers. Since many of the organizations for which they work have limited…
-
Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the…
-
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in…
-
This paper describes a new program, CORRECT, which takes words rejected by the Unix® SPELL program, proposes a list of candidate corrections, and sorts them by probability score. The probability scores are the novel contribution of this work. They…
-
-
In the past week, two friends (Dean and Bill) independently told me they were amazed at how Google does spelling correction so well and quickly. Type in a search like [speling] and Google comes back in 0.1 seconds or so with Did you mean: spelling.…
-
-
In this paper the Damerau-Levenshtein string difference metric is generalized in two ways to more accurately compensate for the types of errors that are present in the script recognition domain. First, the basic dynamic programming method for…
-
Research aimed at correcting words in text has focused on three progressively more difficult problems:(1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction. In response to the first problem,…
-
The paper will describe how web-based collaboration tools can engage users in the building of historical printed text resources created by mass digitisation projects. The drivers for developing such tools will be presented, identifying the benefits…
-
-
We propose a low cost method for the correction of the output of OCR engines through the use of human labor. The method employs an error estimator neural network that learns to assess the error probability of every word from ground-truth data. The…
-
This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) post- processing. The system exploits information from multiple…


