Visible to the public Machine Learning based Malware Detection in Cloud Environment using Clustering Approach

TitleMachine Learning based Malware Detection in Cloud Environment using Clustering Approach
Publication TypeConference Paper
Year of Publication2020
AuthorsKumar, Rahul, Sethi, Kamalakanta, Prajapati, Nishant, Rout, Rashmi Ranjan, Bera, Padmalochan
Conference Name2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT)
Date PublishedJuly 2020
ISBN Number978-1-7281-6851-7
Keywordscloud, cloud computing, clustering, Collaboration, collaboration agreements, composability, Computational modeling, cuckoo sandbox, False Positive Rate (FPR), feature extraction, machine learning, Malware, malware detection, policy-based governance, Principal Component Analysis (PCA), pubcrawl, Sandboxing, Scalability, Solid modeling, Training, Trend Micro Locality Sensitive Hashing (TLSH)

Enforcing security and resilience in a cloud platform is an essential but challenging problem due to the presence of a large number of heterogeneous applications running on shared resources. A security analysis system that can detect threats or malware must exist inside the cloud infrastructure. Much research has been done on machine learning-driven malware analysis, but it is limited in computational complexity and detection accuracy. To overcome these drawbacks, we proposed a new malware detection system based on the concept of clustering and trend micro locality sensitive hashing (TLSH). We used Cuckoo sandbox, which provides dynamic analysis reports of files by executing them in an isolated environment. We used a novel feature extraction algorithm to extract essential features from the malware reports obtained from the Cuckoo sandbox. Further, the most important features are selected using principal component analysis (PCA), random forest, and Chi-square feature selection methods. Subsequently, the experimental results are obtained for clustering and non-clustering approaches on three classifiers, including Decision Tree, Random Forest, and Logistic Regression. The model performance shows better classification accuracy and false positive rate (FPR) as compared to the state-of-the-art works and non-clustering approach at significantly lesser computation cost.

Citation Keykumar_machine_2020