Visible to the public Automated Threat Report Classification over Multi-Source Data

TitleAutomated Threat Report Classification over Multi-Source Data
Publication TypeConference Paper
Year of Publication2018
AuthorsAyoade, G., Chandra, S., Khan, L., Hamlen, K., Thuraisingham, B.
Conference Name2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC)
Date Publishedoct
Keywordsadvanced persistent threats, automated threat report classification, bias correction, business data processing, Collaboration, command and control systems, data mining, defense systems, document handling, enterprise system defenders, feature extraction, Human Behavior, learning (artificial intelligence), machine learning model, Metrics, multisource data, natural language processing, natural language processing techniques, NLP, Organizations, pattern classification, pubcrawl, Resiliency, Scalability, security, security of data, Standards organizations, Threat report, threat report documents, Training

With an increase in targeted attacks such as advanced persistent threats (APTs), enterprise system defenders require comprehensive frameworks that allow them to collaborate and evaluate their defense systems against such attacks. MITRE has developed a framework which includes a database of different kill-chains, tactics, techniques, and procedures that attackers employ to perform these attacks. In this work, we leverage natural language processing techniques to extract attacker actions from threat report documents generated by different organizations and automatically classify them into standardized tactics and techniques, while providing relevant mitigation advisories for each attack. A naive method to achieve this is by training a machine learning model to predict labels that associate the reports with relevant categories. In practice, however, sufficient labeled data for model training is not always readily available, so that training and test data come from different sources, resulting in bias. A naive model would typically underperform in such a situation. We address this major challenge by incorporating an importance weighting scheme called bias correction that efficiently utilizes available labeled data, given threat reports, whose categories are to be automatically predicted. We empirically evaluated our approach on 18,257 real-world threat reports generated between year 2000 and 2018 from various computer security organizations to demonstrate its superiority by comparing its performance with an existing approach.

Citation Keyayoade_automated_2018