Visible to the public HAC-T and Fast Search for Similarity in Security

TitleHAC-T and Fast Search for Similarity in Security
Publication TypeConference Paper
Year of Publication2020
AuthorsOliver, J., Ali, M., Hagen, J.
Conference Name2020 International Conference on Omni-layer Intelligent Systems (COINS)
KeywordsApproximate Nearest Neighbour, cluster malware, cluster quality, clustering, Clustering algorithms, clustering technique, computer viruses, cryptography, fast nearest neighbor search techniques, Fuzzy Hashing, HAC-T, Hierarchical Agglomerative Clustering (HAC), Indexes, Malware, Measurement, nearest neighbor search, nearest neighbour methods, Predictive Metrics, pubcrawl, search problems, security, security analysts, security of data, security operations center operators, similarity digests, Standards, threshold based hierarchical agglomerative clustering algorithm, Tools, tree-based index, Trend Locality Sensitive Hashing (TLSH)
AbstractSimilarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.
Citation Keyoliver_hac-t_2020