Visible to the public Iterative Classification for Sanitizing Large-Scale DatasetsConflict Detection Enabled

TitleIterative Classification for Sanitizing Large-Scale Datasets
Publication TypeConference Paper
Year of Publication2015
AuthorsLi, Bo, Vorobeychik, Yevgeniy, Li, Muqun, Malin, Bradley
Conference NameSIAM International Conference on Data Mining
Date Published05/2015
Conference LocationVancouver, British Columbia, Canada
KeywordsFoundations, Hierarchical Coordination and Control, Resilient Systems, Science of decentralized security, science of security, SURE Project

Cheap ubiquitous computing enables the collectionof massive amounts of personal data in a wide variety of domains.Many organizations aim to share such data while obscuring fea-tures that could disclose identities or other sensitive information.Much of the data now collected exhibits weak structure (e.g.,natural language text) and machine learning approaches havebeen developed to identify and remove sensitive entities in suchdata. Learning-based approaches are never perfect and relyingupon them to sanitize data can leak sensitive information as aconsequence. However, a small amount of risk is permissiblein practice, and, thus, our goal is to balance the value ofdata published and the risk of an adversary discovering leakedsensitive information. We model data sanitization as a gamebetween 1) a publisher who chooses a set of classifiers to applyto data and publishes only instances predicted to be non-sensitiveand 2) an attacker who combines machine learning and manualinspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for thepublisher that provably executes no more than a linear numberof iterations, and ensures a low utility for a resource-limitedadversary. Moreover, using several real world natural languagecorpora, we illustrate that our greedy algorithm leaves virtuallyno automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

Citation Keyliiterative