Visible to the public Data Sanitization 2015Conflict Detection Enabled

SoS Newsletter- Advanced Book Block


SoS Logo

Data Sanitization 2015

For security researchers, privacy protection during data mining is a major concern.  Sharing information over the Internet or holding it in a database requires methods of sanitizing data so that personal information cannot be obtained.  The methods described in the articles listed here include SQL injections, provenance workflows, itemset hiding, differential privacy, and a framework for a mathematical definition of privacy.  The work cited here was presented in 2015.

Abdullah, Hadi; Siddiqi, Ahsan; Bajaber, Fuad, "A Novel Approach of Data Sanitization by Noise Addition and Knowledge Discovery by Clustering," in Computer Networks and Information Security (WSCNIS), 2015 World Symposium on, pp. 1-9, 19-21 Sept. 2015. doi: 10.1109/WSCNIS.2015.7368283

Abstract: Security of published data cannot be less important as compared to unpublished data or the data which is not made public. Therefore, PII (Personally Identifiable Information) is removed and data sanitized when organizations recording large volumes of data publish that data. However, this approach of ensuring data privacy and security can result in loss of utility of that published data for knowledge discovery. Therefore, a balance is required between privacy and the utility needs of published data. In this paper we study this delicate balance by evaluating four data mining clustering techniques for knowledge discovery and propose two privacy/utility quantification parameters. We subsequently perform number of experiments to statistically identify which clustering technique is best suited with desirable level of privacy/utility while noise is incrementally increased by simultaneously degrading data accuracy, completeness and consistency.

Keywords: Data privacy; Data security; Databases; Knowledge discovery; Privacy; data mining; data utility; noise; privacy; security (ID#: 15-8741)



Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun; Malin, Bradley, "Iterative Classification for Sanitizing Large-Scale Datasets," in Data Mining (ICDM), 2015 IEEE International Conference on, pp. 841-846, 14-17 Nov. 2015

doi: 10.1109/ICDM.2015.11

Abstract: Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose entities or other sensitive information. Much of the data now collected exhibits weak structure (e.g., natural language text) and machine learning approaches have been developed to identify and remove sensitive entities in such data. Learning-based approaches are never perfect and relying upon them to sanitize data can leak sensitive information as a consequence. However, a small amount of risk is permissible in practice, and, thus, our goal is to balance the value of data published and the risk of an adversary discovering leaked sensitive information. We model data sanitization as a game between1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted to be non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked sensitive entities (e.g., personal names). We introduce an iterative greedy algorithm for the publisher that provably executes no more than a linear number of iterations, and ensures a low utility for a resource-limited adversary. Moreover, using several real world natural language corpora, we illustrate that our greedy algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and complete after at most 5 iterations.

Keywords: Data models; Inspection; Manuals; Natural languages; Predictive models; Publishing; Yttrium; Privacy preserving; game theory; weak structured data sanitization (ID#: 15-8742)



Shanmugasundaram, G.; Ravivarman, S.; Thangavellu, P., "A Study on Removal Techniques of Cross-Site Scripting From Web Applications," in Computation of Power, Energy Information and Communication (ICCPEIC), 2015 International Conference on, pp. 0436-0442, 22-23 April 2015. doi: 10.1109/ICCPEIC.2015.7259498

Abstract: Cross site scripting (XSS) vulnerability is among the top 10 web application vulnerabilities based on survey by Open Web Applications Security Project of 2013 [9]. The XSS attack occurs when web based application takes input from users through web pages without validating them. An attacker or hacker uses this to insert malicious scripts in web pages through such inputs. So, the scripts can perform malicious actions when a client visits the vulnerable web pages. This study concentrates on various security measures for removal of XSS from web applications (say defensive coding technique) and their issues of defensive technique based on that measures is reported in this paper.

Keywords: Internet; security of data; Web application vulnerability; XSS attack; cross-site scripting; removal technique; Encoding; HTML; Java; Uniform resource locators; cross site scripting; data sanitization; data validation; defensive coding technique; output escaping; scripting languages; vulnerabilities (ID#: 15-8743)



Adebayo, J.; Kagal, L., "A Privacy Protection Procedure for Large Scale Individual Level Data," in Intelligence and Security Informatics (ISI), 2015 IEEE International Conference on, pp. 120-125, 27-29 May 2015. doi: 10.1109/ISI.2015.7165950

Abstract: We present a transformation procedure for large scale individual level data that produces output data in which no linear combinations of the resulting attributes can yield the original sensitive attributes from the transformed data. In doing this, our procedure eliminates all linear information regarding a sensitive attribute from the input data. The algorithm combines principal components analysis of the data set with orthogonal projection onto the subspace containing the sensitive attribute(s). The algorithm presented is motivated by applications where there is a need to drastically `sanitize' a data set of all information relating to sensitive attribute(s) before analysis of the data using a data mining algorithm. Sensitive attribute removal (sanitization) is often needed to prevent disparate impact and discrimination on the basis of race, gender, and sexual orientation in high stakes contexts such as determination of access to loans, credit, employment, and insurance. We show through experiments that our proposed algorithm outperforms other privacy preserving techniques by more than 20 percent in lowering the ability to reconstruct sensitive attributes from large scale data.

Keywords: data analysis; data mining; data privacy; principal component analysis; data mining algorithm; large scale individual level data; orthogonal projection; principal component analysis; privacy protection procedure; sanitization; sensitive attribute removal; Data privacy; Loans and mortgages; Noise; Prediction algorithms; Principal component analysis; Privacy; PCA; data mining; orthogonal projection; privacy preserving (ID#: 15-8744)



Yuan Hong; Vaidya, J.; Haibing Lu; Karras, P.; Goel, S., "Collaborative Search Log Sanitization: Toward Differential Privacy and Boosted Utility," in Dependable and Secure Computing, IEEE Transactions on, vol. 12, no. 5, pp. 504-518, Sept.-Oct. 1 2015. doi: 10.1109/TDSC.2014.2369034

Abstract: Severe privacy leakage in the AOL search log incident has attracted considerable worldwide attention. However, all the web users' daily search intents and behavior are collected in such data, which can be invaluable for researchers, data analysts and law enforcement personnel to conduct social behavior study [14], criminal investigation [5] and epidemics detection [10]. Thus, an important and challenging research problem is how to sanitize search logs with strong privacy guarantee and sufficiently retained utility. Existing approaches in search log sanitization are capable of only protecting the privacy under a rigorous standard [24] or maintaining good output utility [25] . To the best of our knowledge, there is little work that has perfectly resolved such tradeoff in the context of search logs, meeting a high standard of both requirements. In this paper, we propose a sanitization framework to tackle the above issue in a distributed manner. More specifically, our framework enables different parties to collaboratively generate search logs with boosted utility while satisfying Differential Privacy. In this scenario, two privacy-preserving objectives arise: first, the collaborative sanitization should satisfy differential privacy; second, the collaborative parties cannot learn any private information from each other. We present an efficient protocol -Collaborative sEarch Log Sanitization (CELS) to meet both privacy requirements. Besides security/privacy and cost analysis, we demonstrate the utility and efficiency of our approach with real data sets.

Keywords: Internet; collaborative filtering; data privacy; protocols; security of data; AOL search log incident; CELS protocol; Collaborative sEarch Log Sanitization; Web user behavior; Web user daily search intent; boosted utility; collaborative search log generation; cost analysis; criminal investigation; data analysts; differential privacy; epidemics detection; law enforcement personnel; privacy guarantee; privacy leakage; privacy protection; privacy requirements; privacy-preserving objectives; private information; security; social behavior study; Collaboration; Data privacy; Diabetes; Equations; Google; Histograms; Privacy; Differential Privacy; Optimization; Sampling; Search Log; Search log; Secure Multiparty Computation; differential privacy; optimization; sampling; secure multiparty computation (ID#: 15-8745)



Lwin Khin Shar; Briand, L.C.; Hee Beng Kuan Tan, "Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning," in Dependable and Secure Computing, IEEE Transactions on, vol. 12, no. 6, pp. 688-707, Nov.-Dec. 1 2015. doi: 10.1109/TDSC.2014.2373377

Abstract: Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of hybrid (static+dynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77 percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.

Keywords: Internet; learning (artificial intelligence); program diagnostics; security of data; SQL injection; Web application vulnerability prediction; cross site scripting; dynamic program analyses; false alarm probability; file inclusion vulnerabilities; hybrid program analysis; hybrid static+dynamic code attributes; input sanitization code patterns; input validation code patterns; machine learning; open source projects; remote code execution; security auditing; semisupervised learning; static program analyses; vulnerability prediction techniques; vulnerability predictors; vulnerable code identification; vulnerable code prediction; Computer security; Data models; HTML; Predictive models; Semisupervised learning; Servers; Software protection; Vulnerability prediction; empirical study; input validation and sanitization; program analysis; security measures (ID#: 15-8746)



Yamaguchi, F.; Maier, A.; Gascon, H.; Rieck, K., "Automatic Inference of Search Patterns for Taint-Style Vulnerabilities," in Security and Privacy (SP), 2015 IEEE Symposium on, pp. 797-812, 17-21 May 2015. doi: 10.1109/SP.2015.54

Abstract: Taint-style vulnerabilities are a persistent problem in software development, as the recently discovered "Heart bleed" vulnerability strikingly illustrates. In this class of vulnerabilities, attacker-controlled data is passed unsanitized from an input source to a sensitive sink. While simple instances of this vulnerability class can be detected automatically, more subtle defects involving data flow across several functions or project-specific APIs are mainly discovered by manual auditing. Different techniques have been proposed to accelerate this process by searching for typical patterns of vulnerable code. However, all of these approaches require a security expert to manually model and specify appropriate patterns in practice. In this paper, we propose a method for automatically inferring search patterns for taint-style vulnerabilities in C code. Given a security-sensitive sink, such as a memory function, our method automatically identifies corresponding source-sink systems and constructs patterns that model the data flow and sanitization in these systems. The inferred patterns are expressed as traversals in a code property graph and enable efficiently searching for unsanitized data flows -- across several functions as well as with project-specific APIs. We demonstrate the efficacy of this approach in different experiments with 5 open-source projects. The inferred search patterns reduce the amount of code to inspect for finding known vulnerabilities by 94.9% and also enable us to uncover 8 previously unknown vulnerabilities.

Keywords: application program interfaces; data flow analysis; public domain software; security of data; software engineering; C code; attacker-controlled data; automatic inference; code property graph; data flow; data security; inferred search pattern; memory function; open-source project; project- specific API; search pattern security-sensitive sink; sensitive sink; software development; source-sink system; taint-style vulnerability; Databases; Libraries; Payloads; Programming; security; Software; Syntactics; Clustering; Graph Databases; Vulnerabilities (ID#: 15-8747)



Jinkun Pan; Xiaoguang Mao; Weishi Li, "Analyst-Oriented Taint Analysis by Taint Path Slicing and Aggregation," in Software Engineering and Service Science (ICSESS), 2015 6th IEEE International Conference on, pp. 145-148, 23-25 Sept. 2015. doi: 10.1109/ICSESS.2015.7339024

Abstract: Taint analysis determines whether values from untrusted or private sources may flow into security-sensitive or public sinks, and can discover many common security vulnerabilities in both Web and mobile applications. Static taint analysis detects suspicious data flows without running the application and achieves a good coverage. However, most existing static taint analysis tools only focus on discovering taint paths from sources to sinks and do not concern about the requirements of analysts for sanitization check and exploration. The sanitization can make a taint path no more dangerous but should be checked or explored by analysts manually in many cases and the process is very costly. During our preliminary study, we found that many statements along taint paths are not relevant to the sanitization and there are a lot of redundancies among taint paths with the same source or sink. Based on these two observations, we have designed and implemented the taint path slicing and aggregation algorithms, aiming at mitigating the workload of the analysts and helping them get a better comprehension of the taint behaviors of target applications. Experimental evaluations on real-world applications show that our proposed algorithms can reduce the taint paths effectively and efficiently.

Keywords: program slicing; security of data; Web application; aggregation algorithms; analyst-oriented taint analysis; exploration; mobile application; public sink; sanitization check; security vulnerabilities; security-sensitive sink; static taint analysis tools; taint path slicing; Algorithm design and analysis; Androids; Filtering; Humanoid robots; Mobile applications; Redundancy; Security; analyst; taint analysis; taint path (ID#: 15-8748)



Jingyu Hua; Yue Gao; Sheng Zhong, "Differentially Private Publication of General Time-Serial Trajectory Data," in Computer Communications (INFOCOM), 2015 IEEE Conference on, pp. 549-557, April 26 2015-May 1 2015. doi: 10.1109/INFOCOM.2015.7218422

Abstract: Trajectory data, i.e., human mobility traces, is extremely valuable for a wide range of mobile applications. However, publishing raw trajectories without special sanitization poses serious threats to individual privacy. Recently, researchers begin to leverage differential privacy to solve this challenge. Nevertheless, existing mechanisms make an implicit assumption that the trajectories contain a lot of identical prefixes or n-grams, which is not true in many applications. This paper aims to remove this assumption and propose a differentially private publishing mechanism for more general time-series trajectories. One natural solution is to generalize the trajectories, i.e., merge the locations at the same time. However, trivial merging schemes may breach differential privacy. We, thus, propose the first differentially-private generalization algorithm for trajectories, which leverage a carefully-designed exponential mechanism to probabilistically merge nodes based on trajectory distances. Afterwards, we propose another efficient algorithm to release trajectories after generalization in a differential private manner. Our experiments with real-life trajectory data show that the proposed mechanism maintains high data utility and is scalable to large trajectory datasets.

Keywords: data privacy; time series; differential privacy; differentially private publication; differentially-private generalization algorithm; human mobility traces; time-serial trajectory data; time-series trajectories; Computers; Conferences; Data Publishing; Differential Privacy; Trajectory (ID#: 15-8749)



Farea, A.; Karci, A., "Applications of Association Rules Hiding Heuristic Approaches," in Signal Processing and Communications Applications Conference (SIU), 2015 23th, pp. 2650-2653, 16-19 May 2015. doi: 10.1109/SIU.2015.7130434

Abstract: Data Mining allows large database owners to extract useful knowledge that could not be deduced with traditional approaches like statistics. However, these sometimes reveal sensitive knowledge or preach individual privacies. The term sanitization is given to the process of changing original database into another one from which we can mine without exposing sensitive knowledge. In this paper, we give a detailed explanation of some heuristic approaches for this purpose. We applied them on a number of publically available datasets and examine the results.

Keywords: data mining; data privacy; association rules hiding heuristic; data mining; database sanitization; Data mining; Itemsets; Data Mining; association rule; confidence; frequent pattern; itemset; sanitization; support; transaction (ID#: 15-8750)



Jung-Woo Sohn; Jungwoo Ryoo, "Securing Web Applications with Better "Patches": An Architectural Approach for Systematic Input Validation with Security Patterns," in Availability, Reliability and Security (ARES), 2015 10th International Conference on, pp. 486-492, 24-27 Aug. 2015. doi: 10.1109/ARES.2015.106

Abstract: Some of the most rampant problems in software security originate from improper input validation. This is partly due to ad hoc approaches taken by software developers when dealing with user inputs. Therefore, it is a crucial research question in software security to ask how to effectively apply well-known input validation and sanitization techniques against security attacks exploiting the user input-related weaknesses found in software. This paper examines the current ways of how input validation is conducted in major open-source projects and attempts to confirm the main source of the problem as these ad hoc responses to the input validation-related attacks such as SQL injection and cross-site scripting (XSS) attacks through a case study. In addition, we propose a more systematic software security approach by promoting the adoption of proactive, architectural design-based solutions to move away from the current practice of chronic vulnerability-centric and reactive approaches.

Keywords: Internet; security of data; software architecture; SQL injection attack; Web application security; XSS attack; ad hoc approaches; architectural approach; architectural design-based solution; chronic vulnerability-centric approach; cross-site scripting attack; input validation-related attacks; proactive-based solution; reactive approach; sanitization techniques; security patterns; systematic input validation; systematic software security approach; user input-related weaknesses; architectural patterns; improper input validation; intercepting validator; software security (ID#: 15-8751)



Riboni, D.; Villani, A.; Vitali, D.; Bettini, C.; Mancini, L.V., "Obfuscation of Sensitive Data for Incremental Release of Network Flows," in Networking, IEEE/ACM Transactions on, vol. 23, no. 2, pp. 672-686, April 2015. doi: 10.1109/TNET.2014.2309011

Abstract: Large datasets of real network flows acquired from the Internet are an invaluable resource for the research community. Applications include network modeling and simulation, identification of security attacks, and validation of research results. Unfortunately, network flows carry extremely sensitive information, and this discourages the publication of those datasets. Indeed, existing techniques for network flow sanitization are vulnerable to different kinds of attacks, and solutions proposed for microdata anonymity cannot be directly applied to network traces. In our previous research, we proposed an obfuscation technique for network flows, providing formal confidentiality guarantees under realistic assumptions about the adversary's knowledge. In this paper, we identify the threats posed by the incremental release of network flows, we propose a novel defense algorithm, and we formally prove the achieved confidentiality guarantees. An extensive experimental evaluation of the algorithm for incremental obfuscation, carried out with billions of real Internet flows, shows that our obfuscation technique preserves the utility of flows for network traffic analysis.

Keywords: Internet; security of data; Internet; adversary knowledge; datasets; microdata anonymity; network flows incremental release; network traces; network traffic analysis; obfuscation technique; real network flows; research community; security attacks; sensitive data obfuscation; Data privacy; Encryption; IP networks; Knowledge engineering; Privacy; Uncertainty; Data sharing; network flow analysis; privacy; security (ID#: 15-8752)



Panja, B.; Gennarelli, T.; Meharia, P., "Handling Cross Site Scripting Attacks using Cache Check to Reduce Webpage Rendering Time with Elimination of Sanitization and Filtering in Light Weight Mobile Web Browser," in Mobile and Secure Services (MOBISECSERV), 2015 First Conference on, pp. 1-7, 20-21 Feb. 2015. doi: 10.1109/MOBISECSERV.2015.7072878

Abstract: In this paper we propose a new approach to prevent and detect potential cross-site scripting attacks. Our method called Buffer Based Cache Check, will utilize both the server-side as well as the client-side to detect and prevent XSS attacks and will require modification of both in order to function correctly. With Cache Check, instead of the server supplying a complete whitelist of all the known trusted scripts to the mobile browser every time a page is requested, the server will instead store a cache that contains a validated “trusted” instance of the last time the page was rendered that can be checked against the requested page for inconsistencies. We believe that with our proposed method that rendering times in mobile browsers will be significantly reduced as part of the checking is done via the server, and fewer checking within the mobile browser which is slower than the server. With our method the entire checking process isn't dumped onto the mobile browser and as a result the mobile browser should be able to render pages faster as it is only checking for “untrusted” content whereas with other approaches, every single line of code is checked by the mobile browser, which increases rendering times.

Keywords: cache storage; client-server systems; mobile computing; online front-ends; security of data; trusted computing; Web page rendering time; XSS attacks; buffer based cache check; client-side; cross-site scripting attacks; filtering; light weight mobile Web browser; sanitization; server-side; trusted instance; untrusted content; Browsers; Filtering; Mobile communication; Radio access networks; Rendering (computer graphics); Security; Servers; Cross site scripting; cache check; mobile browser; webpage rendering (ID#: 15-8753)



Reffett, C.; Fleck, D., "Securing Applications with Dyninst," in Technologies for Homeland Security (HST), 2015 IEEE International Symposium on, pp. 1-6, 14-16 April 2015. doi: 10.1109/THS.2015.7225297

Abstract: While significant bodies of work exist for sandboxing potentially malicious software and for sanitizing input, there has been little investigation into using binary editing software to perform either of these tasks. However, because binary editors do not require source code and can modify the software, they can generate secure versions of arbitrary binaries and provide better control over the software than existing approaches. In this paper, we explore the application of the binary editing library Dyninst to both the sandboxing and sanitization problems. We also create a prototype of a more advanced graphical tool to perform these tasks. Finally, we lay the groundwork for more complex and functional tools to solve these problems.

Keywords: program diagnostics; security of data; software libraries; Dyninst; arbitrary binaries; binary editing library; binary editing software; binary editors; graphical tool; input sanitization; malicious software; sandboxing; sanitization problems; secure versions; securing applications; Graphical user interfaces; Instruments; Libraries; Memory management; Monitoring; Runtime; Software; binary instrumentation; dyninst; input sanitization; sandboxing (ID#: 15-8754)



Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.