Visible to the public Biblio

Filters: Keyword is big data privacy  [Clear All Filters]
Suwansrikham, P., She, K..  2018.  Asymmetric Secure Storage Scheme for Big Data on Multiple Cloud Providers. 2018 IEEE 4th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing, (HPSC) and IEEE International Conference on Intelligent Data and Security (IDS). :121-125.
Recently, cloud computing is an emerging technology along with big data. Both technologies come together. Due to the enormous size of data in big data, it is impossible to store them in local storage. Alternatively, even we want to store them locally, we have to spend much money to create bit data center. One way to save money is store big data in cloud storage service. Cloud storage service provides users space and security to store the file. However, relying on single cloud storage may cause trouble for the customer. CSP may stop its service anytime. It is too risky if data owner hosts his file only single CSP. Also, the CSP is the third party that user have to trust without verification. After deploying his file to CSP, the user does not know who access his file. Even CSP provides a security mechanism to prevent outsider attack. However, how user ensure that there is no insider attack to steal or corrupt the file. This research proposes the way to minimize the risk, ensure data privacy, also accessing control. The big data file is split into chunks and distributed to multiple cloud storage provider. Even there is insider attack; the attacker gets only part of the file. He cannot reconstruct the whole file. After splitting the file, metadata is generated. Metadata is a place to keep chunk information, includes, chunk locations, access path, username and password of data owner to connect each CSP. Asymmetric security concept is applied to this research. The metadata will be encrypted and transfer to the user who requests to access the file. The file accessing, monitoring, metadata transferring is functions of dew computing which is an intermediate server between the users and cloud service.
Mito, M., Murata, K., Eguchi, D., Mori, Y., Toyonaga, M..  2018.  A Data Reconstruction Method for The Big-Data Analysis. 2018 9th International Conference on Awareness Science and Technology (iCAST). :319-323.
In recent years, the big-data approach has become important within various business operations and sales judgment tactics. Contrarily, numerous privacy problems limit the progress of their analysis technologies. To mitigate such problems, this paper proposes several privacy-preserving methods, i.e., anonymization, extreme value record elimination, fully encrypted analysis, and so on. However, privacy-cracking fears still remain that prevent the open use of big-data by other, external organizations. We propose a big-data reconstruction method that does not intrinsically use privacy data. The method uses only the statistical features of big-data, i.e., its attribute histograms and their correlation coefficients. To verify whether valuable information can be extracted using this method, we evaluate the data by using Self Organizing Map (SOM) as one of the big-data analysis tools. The results show that the same pieces of information are extracted from our data and the big-data.
Leung, C. K., Hoi, C. S. H., Pazdor, A. G. M., Wodi, B. H., Cuzzocrea, A..  2018.  Privacy-Preserving Frequent Pattern Mining from Big Uncertain Data. 2018 IEEE International Conference on Big Data (Big Data). :5101-5110.
As we are living in the era of big data, high volumes of wide varieties of data which may be of different veracity (e.g., precise data, imprecise and uncertain data) are easily generated or collected at a high velocity in many real-life applications. Embedded in these big data is valuable knowledge and useful information, which can be discovered by big data science solutions. As a popular data science task, frequent pattern mining aims to discover implicit, previously unknown and potentially useful information and valuable knowledge in terms of sets of frequently co-occurring merchandise items and/or events. Many of the existing frequent pattern mining algorithms use a transaction-centric mining approach to find frequent patterns from precise data. However, there are situations in which an item-centric mining approach is more appropriate, and there are also situations in which data are imprecise and uncertain. Hence, in this paper, we present an item-centric algorithm for mining frequent patterns from big uncertain data. In recent years, big data have been gaining the attention from the research community as driven by relevant technological innovations (e.g., clouds) and novel paradigms (e.g., social networks). As big data are typically published online to support knowledge management and fruition processes, these big data are usually handled by multiple owners with possible secure multi-part computation issues. Thus, privacy and security of big data has become a fundamental problem in this research context. In this paper, we present, not only an item-centric algorithm for mining frequent patterns from big uncertain data, but also a privacy-preserving algorithm. In other words, we present- in this paper-a privacy-preserving item-centric algorithm for mining frequent patterns from big uncertain data. Results of our analytical and empirical evaluation show the effectiveness of our algorithm in mining frequent patterns from big uncertain data in a privacy-preserving manner.
Cuzzocrea, A., Damiani, E..  2018.  Pedigree-Ing Your Big Data: Data-Driven Big Data Privacy in Distributed Environments. 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). :675-681.
This paper introduces a general framework for supporting data-driven privacy-preserving big data management in distributed environments, such as emerging Cloud settings. The proposed framework can be viewed as an alternative to classical approaches where the privacy of big data is ensured via security-inspired protocols that check several (protocol) layers in order to achieve the desired privacy. Unfortunately, this injects considerable computational overheads in the overall process, thus introducing relevant challenges to be considered. Our approach instead tries to recognize the "pedigree" of suitable summary data representatives computed on top of the target big data repositories, hence avoiding computational overheads due to protocol checking. We also provide a relevant realization of the framework above, the so-called Data-dRIven aggregate-PROvenance privacypreserving big Multidimensional data (DRIPROM) framework, which specifically considers multidimensional data as the case of interest.
Khan, Latifur.  2018.  Big IoT Data Stream Analytics with Issues in Privacy and Security. Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics. :22-22.
Internet of Things (IoT) Devices are monitoring and controlling systems that interact with the physical world by collecting, processing and transmitting data using the internet. IoT devices include home automation systems, smart grid, transportation systems, medical devices, building controls, manufacturing and industrial control systems. With the increase in deployment of IoT devices, there will be a corresponding increase in the amount of data generated by these devices, therefore, resulting in the need of large scale data processing systems to process and extract information for efficient and impactful decision making that will improve quality of living.
El Haourani, Lamia, Elkalam, Anas Abou, Ouahman, Abdelah Ait.  2018.  Knowledge Based Access Control a Model for Security and Privacy in the Big Data. Proceedings of the 3rd International Conference on Smart City Applications. :16:1-16:8.
The most popular features of Big Data revolve around the so-called "3V" criterion: Volume, Variety and Velocity. Big Data is based on the massive collection and in-depth analysis of personal data, with a view to profiling, or even marketing and commercialization, thus violating citizens' privacy and the security of their data. In this article we discuss security and privacy solutions in the context of Big Data. We then focus on access control and present our new model called Knowledge-based Access Control (KBAC); this strengthens the access control already deployed in the target company (e.g., based on "RBAC" role or "ABAC" attributes for example) by adding a semantic access control layer. KBAC offers thinner access control, tailored to Big Data, with effective protection against intrusion attempts and unauthorized data inferences.
Yan, Li, Hao, Xiaowei, Cheng, Zelei, Zhou, Rui.  2018.  Cloud Computing Security and Privacy. Proceedings of the 2018 International Conference on Big Data and Computing. :119-123.
Cloud computing is an emerging technology that can provide organizations, enterprises and governments with cheaper, more convenient and larger scale computing resources. However, cloud computing will bring potential risks and threats, especially on security and privacy. We make a survey on potential threats and risks and existing solutions on cloud security and privacy. We also put forward some problems to be addressed to provide a secure cloud computing environment.
Guerriero, Michele, Tamburri, Damian Andrew, Di Nitto, Elisabetta.  2018.  Defining, Enforcing and Checking Privacy Policies in Data-Intensive Applications. Proceedings of the 13th International Conference on Software Engineering for Adaptive and Self-Managing Systems. :172-182.
The rise of Big Data is leading to an increasing demand for large-scale data-intensive applications (DIAs), which have to analyse massive amounts of personal data (e.g. customers' location, cars' speed, people heartbeat, etc.), some of which can be sensitive, meaning that its confidentiality has to be protected. In this context, DIA providers are responsible for enforcing privacy policies that account for the privacy preferences of data subjects as well as for general privacy regulations. This is the case, for instance, of data brokers, i.e. companies that continuously collect and analyse data in order to provide useful analytics to their clients. Unfortunately, the enforcement of privacy policies in modern DIAs tends to become cumbersome because (i) the number of policies can easily explode, depending on the number of data subjects, (ii) policy enforcement has to autonomously adapt to the application context, thus, requiring some non-trivial runtime reasoning, and (iii) designing and developing modern DIAs is complex per se. For the above reasons, we need specific design and runtime methods enabling so called privacy-by-design in a Big Data context. In this article we propose an approach for specifying, enforcing and checking privacy policies on DIAs designed according to the Google Dataflow model and we show that the enforcement approach behaves correctly in the considered cases and introduces a performance overhead that is acceptable given the requirements of a typical DIA.
Gursoy, Mehmet Emre, Liu, Ling, Truex, Stacey, Yu, Lei, Wei, Wenqi.  2018.  Utility-Aware Synthesis of Differentially Private and Attack-Resilient Location Traces. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. :196-211.
As mobile devices and location-based services become increasingly ubiquitous, the privacy of mobile users' location traces continues to be a major concern. Traditional privacy solutions rely on perturbing each position in a user's trace and replacing it with a fake location. However, recent studies have shown that such point-based perturbation of locations is susceptible to inference attacks and suffers from serious utility losses, because it disregards the moving trajectory and continuity in full location traces. In this paper, we argue that privacy-preserving synthesis of complete location traces can be an effective solution to this problem. We present AdaTrace, a scalable location trace synthesizer with three novel features: provable statistical privacy, deterministic attack resilience, and strong utility preservation. AdaTrace builds a generative model from a given set of real traces through a four-phase synthesis process consisting of feature extraction, synopsis learning, privacy and utility preserving noise injection, and generation of differentially private synthetic location traces. The output traces crafted by AdaTrace preserve utility-critical information existing in real traces, and are robust against known location trace attacks. We validate the effectiveness of AdaTrace by comparing it with three state of the art approaches (ngram, DPT, and SGLT) using real location trace datasets (Geolife and Taxi) as well as a simulated dataset of 50,000 vehicles in Oldenburg, Germany. AdaTrace offers up to 3-fold improvement in trajectory utility, and is orders of magnitude faster than previous work, while preserving differential privacy and attack resilience.
Colombo, Pietro, Ferrari, Elena.  2018.  Access Control in the Era of Big Data: State of the Art and Research Directions. Proceedings of the 23Nd ACM on Symposium on Access Control Models and Technologies. :185-192.
Data security and privacy issues are magnified by the volume, the variety, and the velocity of Big Data and by the lack, up to now, of a standard data model and related data manipulation language. In this paper, we focus on one of the key data security services, that is, access control, by highlighting the differences with traditional data management systems and describing a set of requirements that any access control solution for Big Data platforms may fulfill. We then describe the state of the art and discuss open research issues.
Mehrpouyan, H., Azpiazu, I. M., Pera, M. S..  2017.  Measuring Personality for Automatic Elicitation of Privacy Preferences. 2017 IEEE Symposium on Privacy-Aware Computing (PAC). :84–95.

The increasing complexity and ubiquity in user connectivity, computing environments, information content, and software, mobile, and web applications transfers the responsibility of privacy management to the individuals. Hence, making it extremely difficult for users to maintain the intelligent and targeted level of privacy protection that they need and desire, while simultaneously maintaining their ability to optimally function. Thus, there is a critical need to develop intelligent, automated, and adaptable privacy management systems that can assist users in managing and protecting their sensitive data in the increasingly complex situations and environments that they find themselves in. This work is a first step in exploring the development of such a system, specifically how user personality traits and other characteristics can be used to help automate determination of user sharing preferences for a variety of user data and situations. The Big-Five personality traits of openness, conscientiousness, extroversion, agreeableness, and neuroticism are examined and used as inputs into several popular machine learning algorithms in order to assess their ability to elicit and predict user privacy preferences. Our results show that the Big-Five personality traits can be used to significantly improve the prediction of user privacy preferences in a number of contexts and situations, and so using machine learning approaches to automate the setting of user privacy preferences has the potential to greatly reduce the burden on users while simultaneously improving the accuracy of their privacy preferences and security.

Wang, Y., Rawal, B., Duan, Q..  2017.  Securing Big Data in the Cloud with Integrated Auditing. 2017 IEEE International Conference on Smart Cloud (SmartCloud). :126–131.

In this paper, we review big data characteristics and security challenges in the cloud and visit different cloud domains and security regulations. We propose using integrated auditing for secure data storage and transaction logs, real-time compliance and security monitoring, regulatory compliance, data environment, identity and access management, infrastructure auditing, availability, privacy, legality, cyber threats, and granular auditing to achieve big data security. We apply a stochastic process model to conduct security analyses in availability and mean time to security failure. Potential future works are also discussed.

Heifetz, A., Mugunthan, V., Kagal, L..  2017.  Shade: A Differentially-Private Wrapper for Enterprise Big Data. 2017 IEEE International Conference on Big Data (Big Data). :1033–1042.

Enterprises usually provide strong controls to prevent cyberattacks and inadvertent leakage of data to external entities. However, in the case where employees and data scientists have legitimate access to analyze and derive insights from the data, there are insufficient controls and employees are usually permitted access to all information about the customers of the enterprise including sensitive and private information. Though it is important to be able to identify useful patterns of one's customers for better customization and service, customers' privacy must not be sacrificed to do so. We propose an alternative — a framework that will allow privacy preserving data analytics over big data. In this paper, we present an efficient and scalable framework for Apache Spark, a cluster computing framework, that provides strong privacy guarantees for users even in the presence of an informed adversary, while still providing high utility for analysts. The framework, titled Shade, includes two mechanisms — SparkLAP, which provides Laplacian perturbation based on a user's query and SparkSAM, which uses the contents of the database itself in order to calculate the perturbation. We show that the performance of Shade is substantially better than earlier differential privacy systems without loss of accuracy, particularly when run on datasets small enough to fit in memory, and find that SparkSAM can even exceed performance of an identical nonprivate Spark query.

Palanisamy, B., Li, C., Krishnamurthy, P..  2017.  Group Privacy-Aware Disclosure of Association Graph Data. 2017 IEEE International Conference on Big Data (Big Data). :1043–1052.

In the age of Big Data, we are witnessing a huge proliferation of digital data capturing our lives and our surroundings. Data privacy is a critical barrier to data analytics and privacy-preserving data disclosure becomes a key aspect to leveraging large-scale data analytics due to serious privacy risks. Traditional privacy-preserving data publishing solutions have focused on protecting individual's private information while considering all aggregate information about individuals as safe for disclosure. This paper presents a new privacy-aware data disclosure scheme that considers group privacy requirements of individuals in bipartite association graph datasets (e.g., graphs that represent associations between entities such as customers and products bought from a pharmacy store) where even aggregate information about groups of individuals may be sensitive and need protection. We propose the notion of $ε$g-Group Differential Privacy that protects sensitive information of groups of individuals at various defined group protection levels, enabling data users to obtain the level of information entitled to them. Based on the notion of group privacy, we develop a suite of differentially private mechanisms that protect group privacy in bipartite association graphs at different group privacy levels based on specialization hierarchies. We evaluate our proposed techniques through extensive experiments on three real-world association graph datasets and our results demonstrate that the proposed techniques are effective, efficient and provide the required guarantees on group privacy.

Shi, Y., Piao, C., Zheng, L..  2017.  Differential-Privacy-Based Correlation Analysis in Railway Freight Service Applications. 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC). :35–39.

With the development of modern logistics industry railway freight enterprises as the main traditional logistics enterprises, the service mode is facing many problems. In the era of big data, for railway freight enterprises, coordinated development and sharing of information resources have become the requirements of the times, while how to protect the privacy of citizens has become one of the focus issues of the public. To prevent the disclosure or abuse of the citizens' privacy information, the citizens' privacy needs to be preserved in the process of information opening and sharing. However, most of the existing privacy preserving models cannot to be used to resist attacks with continuously growing background knowledge. This paper presents the method of applying differential privacy to protect associated data, which can be shared in railway freight service association information. First, the original service data need to slice by optimal shard length, then differential method and apriori algorithm is used to add Laplace noise in the Candidate sets. Thus the citizen's privacy information can be protected even if the attacker gets strong background knowledge. Last, sharing associated data to railway information resource partners. The steps and usefulness of the discussed privacy preservation method is illustrated by an example.

Guan, Z., Si, G., Du, X., Liu, P., Zhang, Z., Zhou, Z..  2017.  Protecting User Privacy Based on Secret Sharing with Fault Tolerance for Big Data in Smart Grid. 2017 IEEE International Conference on Communications (ICC). :1–6.

In smart grid, large quantities of data is collected from various applications, such as smart metering substation state monitoring, electric energy data acquisition, and smart home. Big data acquired in smart grid applications is usually sensitive. For instance, in order to dispatch accurately and support the dynamic price, lots of smart meters are installed at user's house to collect the real-time data, but all these collected data are related to user privacy. In this paper, we propose a data aggregation scheme based on secret sharing with fault tolerance in smart grid, which ensures that control center gets the integrated data without revealing user's privacy. Meanwhile, we also consider fault tolerance during the data aggregation. At last, we analyze the security of our scheme and carry out experiments to validate the results.

Zebboudj, S., Brahami, R., Mouzaia, C., Abbas, C., Boussaid, N., Omar, M..  2017.  Big Data Source Location Privacy and Access Control in the Framework of IoT. 2017 5th International Conference on Electrical Engineering - Boumerdes (ICEE-B). :1–5.

In the recent years, we have observed the development of several connected and mobile devices intended for daily use. This development has come with many risks that might not be perceived by the users. These threats are compromising when an unauthorized entity has access to private big data generated through the user objects in the Internet of Things. In the literature, many solutions have been proposed in order to protect the big data, but the security remains a challenging issue. This work is carried out with the aim to provide a solution to the access control to the big data and securing the localization of their generator objects. The proposed models are based on Attribute Based Encryption, CHORD protocol and $μ$TESLA. Through simulations, we compare our solutions to concurrent protocols and we show its efficiency in terms of relevant criteria.

Nosouhi, M. R., Pham, V. V. H., Yu, S., Xiang, Y., Warren, M..  2017.  A Hybrid Location Privacy Protection Scheme in Big Data Environment. GLOBECOM 2017 - 2017 IEEE Global Communications Conference. :1–6.

Location privacy has become a significant challenge of big data. Particularly, by the advantage of big data handling tools availability, huge location data can be managed and processed easily by an adversary to obtain user private information from Location-Based Services (LBS). So far, many methods have been proposed to preserve user location privacy for these services. Among them, dummy-based methods have various advantages in terms of implementation and low computation costs. However, they suffer from the spatiotemporal correlation issue when users submit consecutive requests. To solve this problem, a practical hybrid location privacy protection scheme is presented in this paper. The proposed method filters out the correlated fake location data (dummies) before submissions. Therefore, the adversary can not identify the user's real location. Evaluations and experiments show that our proposed filtering technique significantly improves the performance of existing dummy-based methods and enables them to effectively protect the user's location privacy in the environment of big data.

Chen, D., Irwin, D..  2017.  Weatherman: Exposing Weather-Based Privacy Threats in Big Energy Data. 2017 IEEE International Conference on Big Data (Big Data). :1079–1086.

Smart energy meters record electricity consumption and generation at fine-grained intervals, and are among the most widely deployed sensors in the world. Energy data embeds detailed information about a building's energy-efficiency, as well as the behavior of its occupants, which academia and industry are actively working to extract. In many cases, either inadvertently or by design, these third-parties only have access to anonymous energy data without an associated location. The location of energy data is highly useful and highly sensitive information: it can provide important contextual information to improve big data analytics or interpret their results, but it can also enable third-parties to link private behavior derived from energy data with a particular location. In this paper, we present Weatherman, which leverages a suite of analytics techniques to localize the source of anonymous energy data. Our key insight is that energy consumption data, as well as wind and solar generation data, largely correlates with weather, e.g., temperature, wind speed, and cloud cover, and that every location on Earth has a distinct weather signature that uniquely identifies it. Weatherman represents a serious privacy threat, but also a potentially useful tool for researchers working with anonymous smart meter data. We evaluate Weatherman's potential in both areas by localizing data from over one hundred smart meters using a weather database that includes data from over 35,000 locations. Our results show that Weatherman localizes coarse (one-hour resolution) energy consumption, wind, and solar data to within 16.68km, 9.84km, and 5.12km, respectively, on average, which is more accurate using much coarser resolution data than prior work on localizing only anonymous solar data using solar signatures.

Hassoon, I. A., Tapus, N., Jasim, A. C..  2017.  Enhance Privacy in Big Data and Cloud via Diff-Anonym Algorithm. 2017 16th RoEduNet Conference: Networking in Education and Research (RoEduNet). :1–5.

The main issue with big data in cloud is the processed or used always need to be by third party. It is very important for the owners of data or clients to trust and to have the guarantee of privacy for the information stored in cloud or analyzed as big data. The privacy models studied in previous research showed that privacy infringement for big data happened because of limitation, privacy guarantee rate or dissemination of accurate data which is obtainable in the data set. In addition, there are various privacy models. In order to determine the best and the most appropriate model to be applied in the future, which also guarantees big data privacy, it is necessary to invest in research and study. In the next part, we surfed some of the privacy models in order to determine the advantages and disadvantages of each model in privacy assurance for big data in cloud. The present study also proposes combined Diff-Anonym algorithm (K-anonymity and differential models) to provide data anonymity with guarantee to keep balance between ambiguity of private data and clarity of general data.

Al-Shomrani, A., Fathy, F., Jambi, K..  2017.  Policy enforcement for big data security. 2017 2nd International Conference on Anti-Cyber Crimes (ICACC). :70–74.

Security and privacy of big data becomes challenging as data grows and more accessible by more and more clients. Large-scale data storage is becoming a necessity for healthcare, business segments, government departments, scientific endeavors and individuals. Our research will focus on the privacy, security and how we can make sure that big data is secured. Managing security policy is a challenge that our framework will handle for big data. Privacy policy needs to be integrated, flexible, context-aware and customizable. We will build a framework to receive data from customer and then analyze data received, extract privacy policy and then identify the sensitive data. In this paper we will present the techniques for privacy policy which will be created to be used in our framework.

Forgó, Nikolaus.  2016.  Privacy and Internet Governance. Proceedings of the 8th ACM Conference on Web Science. :6–6.

Many of the game-changing innovations the Internet brought and continues to bring to all of our daily professional and private lifes come with privacy-related costs. The more day-to-day activities are based on the Internet, the more personal data are generated, collected, stored and used. Big Data, Internet of Things, cyber-physical-systems and similar trends will be based on even more personal information all of us use and produce constantly. Three major points are to be noted here: First, there is no common European or even worldwide agreement whether and in how far these collections need to be limited. There is, though, no common privacy law âĂŞ neither in Europe nore worldwide. Second, laws that do exist constantly fail in steering the developments. Technology innovations come so fast, are so disruptive and so market-demand driven, that an ex-post control by law and courts constantly comes late and/or is circumvented and/or ignored. Third, lack of consensus and lack of steering lead to huge data accumulations and market monopolies built up very quickly and held by very few companies working on a global level with data driven business models. These early movers are in many cases in very dominant market positions making it not only more difficult to regulate their behavior but also to keep the markets open for future competitors. This workshop will evaluate current European and international attempts to deal with this situation. Although all four panelists have a legal background, the meeting will be less interested in an in-depth review of existing laws and their impact, but more in the underlying technological and ethical principles (and their inconsistencies) leading to the sitation described. Specific attention will be attributed to technology driven attempts to deal with the situation, such as privacy by design, privacy by default, usable privacy etc.

Eom, Chris Soo-Hyun, Lee, Wookey, Lee, James Jung-Hun.  2016.  Spammer Detection for Real-time Big Data Graphs. Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and Theory. :51–60.

In recent years, prodigious explosion of social network services may trigger new business models. However, it has negative aspects such as personal information spill or spamming, as well. Amongst conventional spam detection approaches, the studies which are based on vertex degrees or Local Clustering Coefficient have been caused false positive results so that normal vertices can be specified as spammers. In this paper, we propose a novel approach by employing the circuit structure in the social networks, which demonstrates the advantages of our work through the experiment.

Wilder, Nathan, Smith, Jared M., Mockus, Audris.  2016.  Exploring a Framework for Identity and Attribute Linking Across Heterogeneous Data Systems. Proceedings of the 2Nd International Workshop on BIG Data Software Engineering. :19–25.

Online-activity-generated digital traces provide opportunities for novel services and unique insights as demonstrated in, for example, research on mining software repositories. The inability to link these traces within and among systems, such as Twitter, GitHub, or Reddit, inhibit the advances in this area. Furthermore, no single approach to integrate data from these disparate sources is likely to work. We aim to design Foreseer, an extensible framework, to design and evaluate identity matching techniques for public, large, and low-accuracy operational data. Foreseer consists of three functionally independent components designed to address the issues of discovery and preparation, storage and representation, and analysis and linking of traces from disparate online sources. The framework includes a domain specific language for manipulating traces, generating insights, and building novel services. We have applied it in a pilot study of roughly 10TB of data from Twitter, Reddit, and StackExchange including roughly 6M distinct entities and, using basic matching techniques, found roughly 83,000 matches among these sources. We plan to add additional entity extraction and identification algorithms, data from other sources, and design tools for facilitating dynamic ingestion and tagging of incoming data on a more robust infrastructure using Apache Spark or another distributed processing framework. We will then evaluate the utility and effectiveness of the framework in applications ranging from identifying malicious contributors in software repositories to the evaluation of the utility of privacy preservation schemes.

Jordan, Michael I..  2016.  On Computational Thinking, Inferential Thinking and Data Science. Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures. :47–47.

The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level-in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. On a formal level, the gap is made evident by the lack of a role for computational concepts such as "runtime" in core statistical theory and the lack of a role for statistical concepts such as "risk" in core computational theory. I present several research vignettes aimed at bridging computation and statistics, including the problem of inference under privacy and communication constraints, and ways to exploit parallelism so as to trade off the speed and accuracy of inference.