Visible to the public Parallelization of Machine Learning Applied to Call Graphs of Binaries for Malware Detection

TitleParallelization of Machine Learning Applied to Call Graphs of Binaries for Malware Detection
Publication TypeConference Paper
Year of Publication2017
AuthorsSearles, R., Xu, L., Killian, W., Vanderbruggen, T., Forren, T., Howe, J., Pearson, Z., Shannon, C., Simmons, J., Cavazos, J.
Conference Name2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)
Keywordsadaptive learning-based techniques, Binary Analysis, byte-level n-grams, call graphs, compiler intermediate representations, compiler techniques, dynamic trace analysis, Engines, feature extraction, graph theory, graph-based program representations, graphics processing units, Heterogeneous computing, Human Behavior, invasive software, Kernel, learning (artificial intelligence), machine learning, machine learning algorithms, Malware, malware analysis, malware detection, Metrics, OpenMP version, parallelization, privacy, program compilers, pubcrawl, resilience, Resiliency, shortest path graph kernel, SPGK, support vector machine, Support vector machines, SVM

Malicious applications have become increasingly numerous. This demands adaptive, learning-based techniques for constructing malware detection engines, instead of the traditional manual-based strategies. Prior work in learning-based malware detection engines primarily focuses on dynamic trace analysis and byte-level n-grams. Our approach in this paper differs in that we use compiler intermediate representations, i.e., the callgraph representation of binaries. Using graph-based program representations for learning provides structure of the program, which can be used to learn more advanced patterns. We use the Shortest Path Graph Kernel (SPGK) to identify similarities between call graphs extracted from binaries. The output similarity matrix is fed into a Support Vector Machine (SVM) algorithm to construct highly-accurate models to predict whether a binary is malicious or not. However, SPGK is computationally expensive due to the size of the input graphs. Therefore, we evaluate different parallelization methods for CPUs and GPUs to speed up this kernel, allowing us to continuously construct up-to-date models in a timely manner. Our hybrid implementation, which leverages both CPU and GPU, yields the best performance, achieving up to a 14.2x improvement over our already optimized OpenMP version. We compared our generated graph-based models to previously state-of-the-art feature vector 2-gram and 3-gram models on a dataset consisting of over 22,000 binaries. We show that our classification accuracy using graphs is over 19% higher than either n-gram model and gives a false positive rate (FPR) of less than 0.1%. We are also able to consider large call graphs and dataset sizes because of the reduced execution time of our parallelized SPGK implementation.

Citation Keysearles_parallelization_2017