Visible to the public A Deep Learning-based Malware Hunting Technique to Handle Imbalanced Data

TitleA Deep Learning-based Malware Hunting Technique to Handle Imbalanced Data
Publication TypeConference Paper
Year of Publication2020
AuthorsMoti, Z., Hashemi, S., Jahromi, A. N.
Conference Name2020 17th International ISC Conference on Information Security and Cryptology (ISCISC)
Keywordsantivirus companies, CNN, common threats, Computational modeling, convolutional neural nets, convolutional neural network, convolutional neural network (CNN), cyber-security dangers, Data models, Deep Learning, deep learning-based malware hunting technique, feature extraction, Generative Adversarial Learning, generative adversarial network, Generative Adversarial Network(GAN), generative adversarial networks, Imbalanced, imbalanced training data sets, Internet, invasive software, learning (artificial intelligence), long short term memory, Long Short Term Memory(LSTM), LSTM, machine learning algorithms, machine learning approaches, Malware, malware samples, multiple class classification problems, Opcode, opcode sequences, oversampling minority classes, pattern classification, Predictive Metrics, pubcrawl, recurrent neural nets, Resiliency, Scalability, Training
AbstractNowadays, with the increasing use of computers and the Internet, more people are exposed to cyber-security dangers. According to antivirus companies, malware is one of the most common threats of using the Internet. Therefore, providing a practical solution is critical. Current methods use machine learning approaches to classify malware samples automatically. Despite the success of these approaches, the accuracy and efficiency of these techniques are still inadequate, especially for multiple class classification problems and imbalanced training data sets. To mitigate this problem, we use deep learning-based algorithms for classification and generation of new malware samples. Our model is based on the opcode sequences, which are given to the model without any pre-processing. Besides, we use a novel generative adversarial network to generate new opcode sequences for oversampling minority classes. Also, we propose the model that is a combination of Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) to classify malware samples. CNN is used to consider short-term dependency between features; while, LSTM is used to consider longer-term dependence. The experiment results show our method could classify malware to their corresponding family effectively. Our model achieves 98.99% validation accuracy.
Citation Keymoti_deep_2020