Text recognition using anonymous CAPTCHA answers

Alexander Shishkin; Anastasia Bezzubtseva; Valentina Fedorova; Alexey Drutsa; Gleb Gusev

Conference ProceedingsOPEN ACCESS

Text recognition using anonymous CAPTCHA answers

WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining (2020) 537-545

DOI: 10.1145/3336191.3371795

2Citations

17Readers

Get full text

Abstract

Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users. In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex [2].

Author supplied keywords

Cite

CITATION STYLE

APA

Shishkin, A., Bezzubtseva, A., Fedorova, V., Drutsa, A., & Gusev, G. (2020). Text recognition using anonymous CAPTCHA answers. In WSDM 2020 - Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 537–545). Association for Computing Machinery, Inc. https://doi.org/10.1145/3336191.3371795

Text recognition using anonymous CAPTCHA answers

Abstract

Author supplied keywords

Cite

Register to see more suggestions