Visible to the public SaTC: CORE: Small: Cybersecurity Big Data Research for Hacker Communities: A Topic and Language Modeling ApproachConflict Detection Enabled

Project Details

Lead PI


Performance Period

Oct 01, 2019 - Sep 30, 2022


University of Arizona


National Science Foundation

Award Number

It is estimated that cybercrime costs the global economy around $445 billion annually, particularly due to intellectual property theft and financial fraud using stolen consumer data. Incidents of large-scale hacking and data theft occur regularly, with many cyberattacks resulting in theft of sensitive personal information or intellectual property. Cybersecurity will remain a critical problem for the foreseeable future, necessitating more research on a large, diverse, covert and evolving international hacker community. Computer science and social science researchers face non-trivial challenges, such as the technical difficulties in data collection and analytics, the massive volume of data collection, the heterogeneity and covert nature of data elements, and the ability to comprehend common hacker terms and concepts across regions. In order to alleviate these challenges, this project has two research goals: 1) advance current capabilities for scalable identification, collection, and analysis of international hacker community contents, and 2) make contributions to the cybersecurity community by developing new big data techniques that could enable researchers to conduct analyses on hacker content and other related domains. The impact of the project is made through the sharing and dissemination of our comprehensive hacker community data collection, advanced collection strategies, and innovative analytical approaches within the NSF Secure and Trustworthy Cyberspace data science and other communities.

This project aims to develop a large, comprehensive and longitudinal testbed of all significant international online hacker community contents, including: forums, IRCs, underground economies, and other emerging hacker assets, for the cybersecurity and big data communities. The analytical approaches mainly aim to address the large-scale international hacker community content analysis for proactive cyber threat intelligence (CTI). In order to analyze hacker contents, the project develops an innovative, holistic, and proactive CTI framework encompassing Cross-Lingual Knowledge Transfer to alleviate the language barrier, Nonparametric Supervised Topic Modeling to profile key hacker assets, and Scalable Dynamic Topic Modeling to inform emerging threat detection. UA's National Security Agency-designated Center of Academic Excellence in Cyber Defense, Research, and Operations, NSF Scholarship-for-Service (SFS) Cyber-Corps, and top-ranked Master's in Cybersecurity programs position the project for synergy with teaching and research. Techniques developed in this project not only advance CTI knowledge, but also deep transfer learning, deep generative modeling, supervised topic modeling, dynamic topic modeling, neural variational inference, and numerous other important domains. Results from this research will be disseminated through various academic and cybersecurity industry channels such as undergraduate and graduate curricula, IEEE Intelligence and Security Informatics conference, National Cyber-Forensics Training Alliance (NCFTA), The Society for the Policing of Cyberspace (POLCYB), and NSF CyberCorps SFS.