Visible to the public A Multi-Classifier Framework for Open Source Malware Forensics

TitleA Multi-Classifier Framework for Open Source Malware Forensics
Publication TypeConference Paper
Year of Publication2018
AuthorsAmjad, N., Afzal, H., Amjad, M. F., Khan, F. A.
Conference Name2018 IEEE 27th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)
ISBN Number978-1-5386-6916-7
Keywordsautomated manner, Bayes methods, computer network security, computer viruses, cyber attack pattern, Cyber Attacks, data cleaning, data mining, Data models, ever-evolving threats, extensive pre-modeling techniques, feature extraction, future cyber threats, Gaussian Naive Bayes classifier, heuristics updates, invasive software, learning (artificial intelligence), machine learning, machine learning algorithms, machine learning models, Malware, malware analysis reports, malware attacks, malware forensics, manual analysis, multiclassifier framework, open source malware forensics, open source malware sandbox, pattern classification, pubcrawl, resilience, Resiliency, Scalability, Security Heuristics, security of data, traditional anti-virus technologies, Training

Traditional anti-virus technologies have failed to keep pace with proliferation of malware due to slow process of their signatures and heuristics updates. Similarly, there are limitations of time and resources in order to perform manual analysis on each malware. There is a need to learn from this vast quantity of data, containing cyber attack pattern, in an automated manner to proactively adapt to ever-evolving threats. Machine learning offers unique advantages to learn from past cyber attacks to handle future cyber threats. The purpose of this research is to propose a framework for multi-classification of malware into well-known categories by applying different machine learning models over corpus of malware analysis reports. These reports are generated through an open source malware sandbox in an automated manner. We applied extensive pre-modeling techniques for data cleaning, features exploration and features engineering to prepare training and test datasets. Best possible hyper-parameters are selected to build machine learning models. These prepared datasets are then used to train the machine learning classifiers and to compare their prediction accuracy. Finally, these results are validated through a comprehensive 10-fold cross-validation methodology. The best results are achieved through Gaussian Naive Bayes classifier with random accuracy of 96% and 10-Fold Cross Validation accuracy of 91.2%. The said framework can be deployed in an operational environment to learn from malware attacks for proactively adapting matching counter measures.

Citation Keyamjad_multi-classifier_2018