Visible to the public Integration of Static and Dynamic Code Stylometry Analysis for Programmer De-Anonymization

TitleIntegration of Static and Dynamic Code Stylometry Analysis for Programmer De-Anonymization
Publication TypeConference Paper
Year of Publication2018
AuthorsWang, Ningfei, Ji, Shouling, Wang, Ting
Conference NameProceedings of the 11th ACM Workshop on Artificial Intelligence and Security
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-6004-3
Keywordsanonymity, code stylometry, composability, de-anonymization, dynamic analysis, Human Behavior, human factors, Metrics, pubcrawl, resilience, Resiliency, stylometry

De-anonymizing the authors of anonymous code (i.e., code stylometry) entails significant privacy and security implications. Most existing code stylometry methods solely rely on static (e.g., lexical, layout, and syntactic) features extracted from source code, while neglecting its key difference from regular text - it is executable! In this paper, we present Sundae, a novel code de-anonymization framework that integrates both static and dynamic stylometry analysis. Compared with the existing solutions, Sundae departs in significant ways: (i) it requires much less number of static, hand-crafted features; (ii) it requires much less labeled data for training; and (iii) it can be readily extended to new programmers once their stylometry information becomes available Through extensive evaluation on benchmark datasets, we demonstrate that Sundae delivers strong empirical performance. For example, under the setting of 229 programmers and 9 problems, it outperforms the state-of-art method by a margin of 45.65% on Python code de-anonymization. The empirical results highlight the integration of static and dynamic analysis as a promising direction for code stylometry research.

Citation KeywangIntegrationStaticDynamic2018