SoS Musings #45 - Privacy in Data Sharing and Analytics

SoS Musings #45 -

Privacy in Data Sharing and Analytics


Data privacy continues to ignite concerns in all realms, including those involving consumers, scientific discovery, and analytics. The consulting firm McKinsey & Co. conducted a survey to which 1,000 North American consumers responded and revealed their views on data collection, privacy, hacks, breaches, regulations, and communications, as well as their trust in the companies they support. The survey revealed low levels of consumer trust regarding data management and privacy, as each sector has a trust rating of less than 50 percent for data protection. Data privacy is essential because the exposure of highly sensitive personal information could impact an individual's livelihood, reputation, relationships, and more. However, data is a crucial source of information for researchers. For example, in the case of COVID-19, data must be shared among government authorities, companies, and researchers to support the advancement of public health, contact tracing, and other studies regarding the pandemic. Efforts are being made to enforce the use of practices that support data privacy like the establishment of data privacy laws such as the European General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), which are among the most prominent of the various privacy laws set in more than a hundred countries. More technological advancements and methods are needed to help organizations and researchers share and effectively analyze datasets while maintaining the privacy of the data.

Studies have demonstrated that subjects in anonymized datasets could still be identified using certain methods and technology, posing a significant threat to individuals' privacy. Aleksandra Slavkovic, a professor of statistics and associate dean for graduate education at the Penn State Eberly College of Science, pointed out the growing risks to data privacy. These risks stem from continued technological advancements in data collection and record linkage, as well as the increased availability of various data sources that could be linked with a retained dataset. Methods for linking two datasets, such as those that contain voter records and health insurance data, have improved significantly. In one study, researchers at University College London and the Alan Turing Institute were able to identify any user in a group of 10,000 Twitter users at an accuracy rate of 96.7 percent using tweets, publicly available metadata from Twitter, and three different machine learning algorithms. In another study published in Nature Communications, researchers from Imperial College London and Belgium's Universite Catholique de Louvain developed a machine learning model claimed to enable the accurate re-identification of 99.98 percent of Americans in any anonymized dataset using 15 basic demographic attributes such as date of birth, gender, ethnicity, and marital status. Such discoveries call on the continued development of advanced methods that combat the de-anonymization of datasets and re-identification of individuals represented in data.

There are various studies and developments aimed at bolstering data privacy. Slavkovic proposed the use of synthetic networks to satisfy the need to share confidential data for statistical analysis while maintaining the statistical accuracy and integrity of the data being shared. A new tool dubbed DoppelGANger developed by researchers at Carnegie Mellon University's CyLab and IBM executes the idea of using synthetic network data. DoppelGANger synthesizes new data that mimics the original dataset while ensuring that the sensitive information is omitted. This tool uses powerful machine learning models called Generative Adversarial Networks (GANs) to synthesize datasets containing statistics of the original training data, simplifying data sharing and preserving the privacy of sensitive information shared between companies, organizations, and governments. Google rolled out an open-source version of its differential privacy library to help organizations draw useful insights from datasets containing private and sensitive information while preventing the re-identification or distinguishing of individuals in the dataset. Differential privacy is an approach involving the combination of random noise with data, resulting in the inability to identify specific individuals using analysis results. Google's Data Loss Prevention tool (DLP) applies machine learning capabilities, including image recognition, machine vision, natural language process, and context analysis, to look for sensitive data in the cloud and automatically redact it. Google introduced an Application Programming Interface (API) for the tool in 2019, allowing administrators to use it outside of Google's ecosystem. The DLP API lets administrators customize the tool based on the specific types of data they want to identify, such as patient information or credit card numbers. According to Scott Ellis, a Product Manager on Google's Security & Privacy team with a focus on data privacy technology for the Google Cloud Platform, the main goals behind the development of the DLP tool are to classify, mask, and de-identify sensitive data so it can still be used for research or analysis without putting the privacy of individuals at risk. Cryptographers and data scientists at Google released Private Join and Compute, a secure Multi-Party Computation (MPC) tool that helps organizations work together on valuable research without revealing information about individuals in the datasets. This tool allows one party to gain aggregated insights about another party's data without either of them being able to learn about individuals represented by the datasets being used. First, both parties encrypt their data using private keys so that no one else can access or decipher it. Then the parties send their encrypted data to each other. The Private Join and Compute tool employs a combination of cryptographic techniques to protect individual data known as private set intersection and homomorphic encryption. Private set intersection allows two parties to compute the intersection of their data (common data point, e.g., location or ID) while preventing the exposure of raw data to the other party. Homomorphic encryption enables computations to be performed on encrypted data without having to decrypt the data, thus only allowing the encrypted results of the computations to be revealed by the owner of the secret decryption key. IBM is also making efforts to change the game of data privacy within the commercial sector through the launch of its Security Homomorphic Encryption Services that lets enterprises test the encryption scheme. According to IBM, industry computing power has increased, and the algorithms used for Fully Homomorphic Encryption (FHE) have become more refined, allowing calculations to be performed fast enough for various types of real-world use cases and early experiments with businesses. IBM is also working on making FHE resistant to future quantum attacks. These developments and efforts call for further exploration.

We must continue to develop and improve methods that allow us to share sensitive data for research purposes while ensuring the accuracy and integrity of data, as well as the privacy of individuals in the data.