Visible to the public A Generative Adversarial Learning Framework for Breaking Text-Based CAPTCHA in the Dark Web

TitleA Generative Adversarial Learning Framework for Breaking Text-Based CAPTCHA in the Dark Web
Publication TypeConference Paper
Year of Publication2020
AuthorsZhang, N., Ebrahimi, M., Li, W., Chen, H.
Conference Name2020 IEEE International Conference on Intelligence and Security Informatics (ISI)
Date PublishedNov. 2020
ISBN Number978-1-7281-8800-3
Keywordsautomated CAPTCHA breaking, cyber threat intelligence, dark web, generative adversarial networks, Human Behavior, human factors, pubcrawl

Cyber threat intelligence (CTI) necessitates automated monitoring of dark web platforms (e.g., Dark Net Markets and carding shops) on a large scale. While there are existing methods for collecting data from the surface web, large-scale dark web data collection is commonly hindered by anti-crawling measures. Text-based CAPTCHA serves as the most prohibitive type of these measures. Text-based CAPTCHA requires the user to recognize a combination of hard-to-read characters. Dark web CAPTCHA patterns are intentionally designed to have additional background noise and variable character length to prevent automated CAPTCHA breaking. Existing CAPTCHA breaking methods cannot remedy these challenges and are therefore not applicable to the dark web. In this study, we propose a novel framework for breaking text-based CAPTCHA in the dark web. The proposed framework utilizes Generative Adversarial Network (GAN) to counteract dark web-specific background noise and leverages an enhanced character segmentation algorithm. Our proposed method was evaluated on both benchmark and dark web CAPTCHA testbeds. The proposed method significantly outperformed the state-of-the-art baseline methods on all datasets, achieving over 92.08% success rate on dark web testbeds. Our research enables the CTI community to develop advanced capabilities of large-scale dark web monitoring.

Citation Keyzhang_generative_2020