Text recognition using anonymous CAPTCHA answers

2Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users. In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex [2].

Cite

CITATION STYLE

APA

Shishkin, A., Bezzubtseva, A., Fedorova, V., Drutsa, A., & Gusev, G. (2020). Text recognition using anonymous CAPTCHA answers. In WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 537–545). Association for Computing Machinery, Inc. https://doi.org/10.1145/3336191.3371795

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free