Visible to the public Biblio

Filters: Author is Yin, Heng  [Clear All Filters]
Feng, Qian, Zhou, Rundong, Xu, Chengcheng, Cheng, Yao, Testa, Brian, Yin, Heng.  2016.  Scalable Graph-based Bug Search for Firmware Images. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. :480–491.

Because of rampant security breaches in IoT devices, searching vulnerabilities in massive IoT ecosystems is more crucial than ever. Recent studies have demonstrated that control-flow graph (CFG) based bug search techniques can be effective and accurate in IoT devices across different architectures. However, these CFG-based bug search approaches are far from being scalable to handle an enormous amount of IoT devices in the wild, due to their expensive graph matching overhead. Inspired by rich experience in image and video search, we propose a new bug search scheme which addresses the scalability challenge in existing cross-platform bug search techniques and further improves search accuracy. Unlike existing techniques that directly conduct searches based upon raw features (CFGs) from the binary code, we convert the CFGs into high-level numeric feature vectors. Compared with the CFG feature, high-level numeric feature vectors are more robust to code variation across different architectures, and can easily achieve realtime search by using state-of-the-art hashing techniques. We have implemented a bug search engine, Genius, and compared it with state-of-art bug search approaches. Experimental results show that Genius outperforms baseline approaches for various query loads in terms of speed and accuracy. We also evaluated Genius on a real-world dataset of 33,045 devices which was collected from public sources and our system. The experiment showed that Genius can finish a search within 1 second on average when performed over 8,126 firmware images of 420,558,702 functions. By only looking at the top 50 candidates in the search result, we found 38 potentially vulnerable firmware images across 5 vendors, and confirmed 23 of them by our manual analysis. We also found that it took only 0.1 seconds on average to finish searching for all 154 vulnerabilities in two latest commercial firmware images from D-LINK. 103 of them are potentially vulnerable in these images, and 16 of them were confirmed.

Korczynski, David, Yin, Heng.  2017.  Capturing Malware Propagations with Code Injections and Code-Reuse Attacks. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. :1691–1708.
Defending against malware involves analysing large amounts of suspicious samples. To deal with such quantities we rely heavily on automatic approaches to determine whether a sample is malicious or not. Unfortunately, complete and precise automatic analysis of malware is far from an easy task. This is because malware is often designed to contain several techniques and countermeasures specifically to hinder analysis. One of these techniques is for the malware to propagate through the operating system so as to execute in the context of benign processes. The malware does this by writing memory to a given process and then proceeds to have this memory execute. In some cases these propagations are trivial to capture because they rely on well-known techniques. However, in the cases where malware deploys novel code injection techniques, rely on code-reuse attacks and potentially deploy dynamically generated code, the problem of capturing a complete and precise view of the malware execution is non-trivial. In this paper we present a unified approach to tracing malware propagations inside the host in the context of code injections and code-reuse attacks. We also present, to the knowledge of the authors, the first approach to identifying dynamically generated code based on information-flow analysis. We implement our techniques in a system called Tartarus and match Tartarus with both synthetic applications and real-world malware. We compare Tartarus to previous works and show that our techniques substantially improve the precision for collecting malware execution traces, and that our approach can capture intrinsic characteristics of novel code injection techniques.
Xu, Xiaojun, Liu, Chang, Feng, Qian, Yin, Heng, Song, Le, Song, Dawn.  2017.  Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. :363–376.

The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph-matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.

Barr, Joseph R., Shaw, Peter, Abu-Khzam, Faisal N., Yu, Sheng, Yin, Heng, Thatcher, Tyler.  2020.  Combinatorial Code Classification Amp; Vulnerability Rating. 2020 Second International Conference on Transdisciplinary AI (TransAI). :80–83.
Empirical analysis of source code of Android Fluoride Bluetooth stack demonstrates a novel approach of classification of source code and rating for vulnerability. A workflow that combines deep learning and combinatorial techniques with a straightforward random forest regression is presented. Two kinds of embedding are used: code2vec and LSTM, resulting in a distance matrix that is interpreted as a (combinatorial) graph whose vertices represent code components, functions and methods. Cluster Editing is then applied to partition the vertex set of the graph into subsets representing nearly complete subgraphs. Finally, the vectors representing the components are used as features to model the components for vulnerability risk.