Visible to the public Low-Cost Breaking of a Unique Chinese Language CAPTCHA Using Curriculum Learning and Clustering

TitleLow-Cost Breaking of a Unique Chinese Language CAPTCHA Using Curriculum Learning and Clustering
Publication TypeConference Paper
Year of Publication2018
AuthorsStein, G., Peng, Q.
Conference Name2018 IEEE International Conference on Electro/Information Technology (EIT)
Date Publishedmay
Keywordsautomated access, CAPTCHA, captchas, Chinese language question-and-answer Website, CNN, composability, convolution, convolutional neural network, convolutional neural networks, correct response, curriculum clustering, curriculum learning, feature map, feedforward neural nets, Human Behavior, human user, image distortion, inverted character, inverted characters, inverted symbols, Kernel, learning (artificial intelligence), low-cost breaking, machine learning methods, Microsoft Windows, natural language processing, OCR software, optical character recognition, pattern clustering, potential training methods, pubcrawl, security, Task Analysis, text analysis, text distortion, text-based CAPTCHAs focus, Training, transcription tasks, unique Chinese language CAPTCHA, web services, Web sites

Text-based CAPTCHAs are still commonly used to attempt to prevent automated access to web services. By displaying an image of distorted text, they attempt to create a challenge image that OCR software can not interpret correctly, but a human user can easily determine the correct response to. This work focuses on a CAPTCHA used by a popular Chinese language question-and-answer website and how resilient it is to modern machine learning methods. While the majority of text-based CAPTCHAs focus on transcription tasks, the CAPTCHA solved in this work is based on localization of inverted symbols in a distorted image. A convolutional neural network (CNN) was created to evaluate the likelihood of a region in the image belonging to an inverted character. It is used with a feature map and clustering to identify potential locations of inverted characters. Training of the CNN was performed using curriculum learning and compared to other potential training methods. The proposed method was able to determine the correct response in 95.2% of cases of a simulated CAPTCHA and 67.6% on a set of real CAPTCHAs. Potential methods to increase difficulty of the CAPTCHA and the success rate of the automated solver are considered.

Citation Keystein_low-cost_2018