Visible to the public Spotlight on Lablet Research #16 - Securing Safety-Critical Machine Learning AlgorithmsConflict Detection Enabled

Spotlight on Lablet Research #16 -

Project: Securing Safety-Critical Machine Learning Algorithms

Lablet: Carnegie Mellon University

This project seeks to understand how classifiers can be spoofed, including in ways that are not apparent to human observers, and how the robustness of classifiers can be enhanced, including through explanations of model behavior.

Machine-learning algorithms, especially classifiers, are becoming prevalent in safety and security-critical applications. The susceptibility of some types of classifiers to being evaded by adversarial input data has been explored in domains such as spam filtering, but the rapid growth in the adoption of machine learning in multiple application domains amplifies the extent and severity of this vulnerability landscape. The CMU research team is led by Principal Investigator (PI) Lujo Bauer and Co-PI Matt Fredrikson, and includes researchers from the Sub-Lablet University of North Carolina. The team proposes to 1) develop predictive metrics that characterize the degree to which a neural-network-based image classifier used in domains such as face recognition can be evaded through attacks that are both practically realizable and inconspicuous; and 2) develop methods that make these classifiers and the applications that incorporate them, robust to such interference. They first examine how to manipulate images to fool classifiers in various ways, and how to do so in a way that escapes the suspicion of even human onlookers and then develop explanations of model behavior to help identify the presence of a likely attack. They will then generalize these explanations to harden models against future attacks.

The researchers have continued their study of network pruning techniques to enhance robustness. Their approach is based on attribution measurements of internal neurons and aims to identify features that are pivotal for adversarial examples but not necessary for the correct classification of normal inputs. Experiments to date suggest that it is possible to identify and remove such non-robust features for norm-bounded attacks, but further suggest that physical attacks may rely on different sets of features that cannot be pruned without significant impact on model performance.

The team's research focused on revising and extending previous results on n-ML (which provides robustness to evasion attacks via ensembles of topologically diversified classifiers) and attacks on malware detection. They've discovered that the utility of and best approaches for tuning n-ML differ depending on the dataset and its complexity, e.g., tunings of n-ML that lead to particularly good performance for MNIST lead to sub-ideal performance for GTSRB and vice versa. In the process, they are discovering tunings of n-ML that further improve its performance compared to other approaches for making classifiers more robust. The team has improved their algorithm for attacking malware classifiers to better gauge the impact of small changes to the binary (e.g., swapping a pair of instructions) on the correctness of the classification of the binary as benign or malware. They have also identified engineering errors in libraries that their code builds on; fixing these should result in significantly lower resource usage, enabling more comprehensive experiments.

In addition to continuing research on n-ML and on evasion attacks against malware classifiers, they are also working on improving experimental infrastructure and methodology, which will enable more automated, much quicker experiments with malware evasion and a more comprehensive examination of the effects of hyperparameter tuning on both attacks on defenses.

The team is also continuing research on investigating the leakage of training data from models and is currently examining the increased risk that techniques from explainability pose to this leakage, as well as the role that robustness plays in this risk. They seek to determine the feasibility of leakage attacks in black-box settings, where explainability methods are most likely to be used.

Background on this project can be found here.