Visible to the public Biblio

Filters: Keyword is Pipelines  [Clear All Filters]
2021-06-24
Angermeir, Florian, Voggenreiter, Markus, Moyón, Fabiola, Mendez, Daniel.  2021.  Enterprise-Driven Open Source Software: A Case Study on Security Automation. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). :278—287.
Agile and DevOps are widely adopted by the industry. Hence, integrating security activities with industrial practices, such as continuous integration (CI) pipelines, is necessary to detect security flaws and adhere to regulators’ demands early. In this paper, we analyze automated security activities in CI pipelines of enterprise-driven open source software (OSS). This shall allow us, in the long-run, to better understand the extent to which security activities are (or should be) part of automated pipelines. In particular, we mine publicly available OSS repositories and survey a sample of project maintainers to better understand the role that security activities and their related tools play in their CI pipelines. To increase transparency and allow other researchers to replicate our study (and to take different perspectives), we further disclose our research artefacts.Our results indicate that security activities in enterprise-driven OSS projects are scarce and protection coverage is rather low. Only 6.83% of the analyzed 8,243 projects apply security automation in their CI pipelines, even though maintainers consider security to be rather important. This alerts industry to keep the focus on vulnerabilities of 3rd Party software and it opens space for other improvements of practice which we outline in this manuscript.
2021-05-25
Javidi, Giti, Sheybani, Ehsan.  2018.  K-12 Cybersecurity Education, Research, and Outreach. 2018 IEEE Frontiers in Education Conference (FIE). :1—5.
This research-to-practice work-in-progress addresses a new approach to cybersecurity education. The cyber security skills shortage is reaching prevalent proportions. The consensus in the STEM community is that the problem begins at k-12 schools with too few students interested in STEM subjects. One way to ensure a larger pipeline in cybersecurity is to train more high school teachers to not only teach cybersecurity in their schools or integrate cybersecurity concepts in their classrooms but also to promote IT security as an attractive career path. The proposed research will result in developing a unique and novel curriculum and scalable program in the area of cybersecurity and a set of powerful tools for a fun learning experience in cybersecurity education. In this project, we are focusing on the potential to advance research agendas in cybersecurity and train the future generation with cybersecurity skills and answer fundamental research questions that still exist in the blended learning methodologies for cybersecurity education and assessment. Leadership and entrepreneurship skills are also added to the mix to prepare students for real-world problems. Delivery methods, timing, format, pacing and outcomes alignment will all be assessed to provide a baseline for future research and additional synergy and integration with existing cybersecurity programs to expand or leverage for new cybersecurity and STEM educational research. This is a new model for cybersecurity education, leadership, and entrepreneurship and there is a possibility of a significant leap towards a more advanced cybersecurity educational methodology using this model. The project will also provide a prototype for innovation coupled with character-building and ethical leadership.
2021-05-05
Rathod, Jash, Joshi, Chaitali, Khochare, Janavi, Kazi, Faruk.  2020.  Interpreting a Black-Box Model used for SCADA Attack detection in Gas Pipelines Control System. 2020 IEEE 17th India Council International Conference (INDICON). :1—7.
Various Machine Learning techniques are considered to be "black-boxes" because of their limited interpretability and explainability. This cannot be afforded, especially in the domain of Cyber-Physical Systems, where there can be huge losses of infrastructure of industries and Governments. Supervisory Control And Data Acquisition (SCADA) systems need to detect and be protected from cyber-attacks. Thus, we need to adopt approaches that make the system secure, can explain predictions made by model, and interpret the model in a human-understandable format. Recently, Autoencoders have shown great success in attack detection in SCADA systems. Numerous interpretable machine learning techniques are developed to help us explain and interpret models. The work presented here is a novel approach to use techniques like Local Interpretable Model-Agnostic Explanations (LIME) and Layer-wise Relevance Propagation (LRP) for interpretation of Autoencoder networks trained on a Gas Pipelines Control System to detect attacks in the system.
Hallaji, Ehsan, Razavi-Far, Roozbeh, Saif, Mehrdad.  2020.  Detection of Malicious SCADA Communications via Multi-Subspace Feature Selection. 2020 International Joint Conference on Neural Networks (IJCNN). :1—8.
Security maintenance of Supervisory Control and Data Acquisition (SCADA) systems has been a point of interest during recent years. Numerous research works have been dedicated to the design of intrusion detection systems for securing SCADA communications. Nevertheless, these data-driven techniques are usually dependant on the quality of the monitored data. In this work, we propose a novel feature selection approach, called MSFS, to tackle undesirable quality of data caused by feature redundancy. In contrast to most feature selection techniques, the proposed method models each class in a different subspace, where it is optimally discriminated. This has been accomplished by resorting to ensemble learning, which enables the usage of multiple feature sets in the same feature space. The proposed method is then utilized to perform intrusion detection in smaller subspaces, which brings about efficiency and accuracy. Moreover, a comparative study is performed on a number of advanced feature selection algorithms. Furthermore, a dataset obtained from the SCADA system of a gas pipeline is employed to enable a realistic simulation. The results indicate the proposed approach extensively improves the detection performance in terms of classification accuracy and standard deviation.
2021-05-03
Shen, Shen, Tedrake, Russ.  2020.  Sampling Quotient-Ring Sum-of-Squares Programs for Scalable Verification of Nonlinear Systems. 2020 59th IEEE Conference on Decision and Control (CDC). :2535–2542.
This paper presents a novel method, combining new formulations and sampling, to improve the scalability of sum-of-squares (SOS) programming-based system verification. Region-of-attraction approximation problems are considered for polynomial, polynomial with generalized Lur'e uncertainty, and rational trigonometric multi-rigid-body systems. Our method starts by identifying that Lagrange multipliers, traditionally heavily used for S-procedures, are a major culprit of creating bloated SOS programs. In light of this, we exploit inherent system properties-continuity, convexity, and implicit algebraic structure-and reformulate the problems as quotient-ring SOS programs, thereby eliminating all the multipliers. These new programs are smaller, sparser, less constrained, yet less conservative. Their computation is further improved by leveraging a recent result on sampling algebraic varieties. Remarkably, solution correctness is guaranteed with just a finite (in practice, very small) number of samples. Altogether, the proposed method can verify systems well beyond the reach of existing SOS-based approaches (32 states); on smaller problems where a baseline is available, it computes tighter solution 2-3 orders of magnitude faster.
2021-03-22
Hikawa, H..  2020.  Nested Pipeline Hardware Self-Organizing Map for High Dimensional Vectors. 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS). :1–4.
This paper proposes a hardware Self-Organizing Map (SOM) for high dimensional vectors. The proposed SOM is based on nested architecture with pipeline processing. Due to homogeneous modular structure, the nested architecture provides high expandability. The original nested SOM was designed to handle low-dimensional vectors with fully parallel computation, and it yielded very high performance. In this paper, the architecture is extended to handle much higher dimensional vectors by using sequential computation, which requires multiple clocks to process a single vector. To increase the performance, the proposed architecture employs pipeline computation, in which search of winner neuron and weight vector update are carried out simultaneously. Operable clock frequency for the system was 60 MHz, and its throughput reached 15012 million connection updates per second (MCUPS).
2021-03-15
Brauckmann, A., Goens, A., Castrillon, J..  2020.  ComPy-Learn: A toolbox for exploring machine learning representations for compilers. 2020 Forum for Specification and Design Languages (FDL). :1–4.
Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.
2021-03-04
Sejr, J. H., Zimek, A., Schneider-Kamp, P..  2020.  Explainable Detection of Zero Day Web Attacks. 2020 3rd International Conference on Data Intelligence and Security (ICDIS). :71—78.

The detection of malicious HTTP(S) requests is a pressing concern in cyber security, in particular given the proliferation of HTTP-based (micro-)service architectures. In addition to rule-based systems for known attacks, anomaly detection has been shown to be a promising approach for unknown (zero-day) attacks. This article extends existing work by integrating outlier explanations for individual requests into an end-to-end pipeline. These end-to-end explanations reflect the internal working of the pipeline. Empirically, we show that found explanations coincide with manually labelled explanations for identified outliers, allowing security professionals to quickly identify and understand malicious requests.

2021-02-08
Wang Xiao, Mi Hong, Wang Wei.  2010.  Inner edge detection of PET bottle opening based on the Balloon Snake. 2010 2nd International Conference on Advanced Computer Control. 4:56—59.

Edge detection of bottle opening is a primary section to the machine vision based bottle opening detection system. This paper, taking advantage of the Balloon Snake, on the PET (Polyethylene Terephthalate) images sampled at rotating bottle-blowing machine producing pipelines, extracts the opening. It first uses the grayscale weighting average method to calculate the centroid as the initial position of Snake and then based on the energy minimal theory, it extracts the opening. Experiments show that compared with the conventional edge detection and center location methods, Balloon Snake is robust and can easily step over the weak noise points. Edge extracted thorough Balloon Snake is more integral and continuous which provides a guarantee to correctly judge the opening.

2020-12-11
Kumar, S., Vasthimal, D. K..  2019.  Raw Cardinality Information Discovery for Big Datasets. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :200—205.
Real-time discovery of all different types of unique attributes within unstructured data is a challenging problem to solve when dealing with multiple petabytes of unstructured data volume everyday. Popular discovery solutions such as the creation of offline jobs to uniquely identify attributes or running aggregation queries on raw data sets limits real time discovery use-cases and often results into poor resource utilization. The discovery information must be treated as a parallel problem to just storing raw data sets efficiently onto back-end big data systems. Solving the discovery problem by creating a parallel discovery data store infrastructure has multiple benefits as it allows such to channel the actual search queries against the raw data set in much more funneled manner instead of being widespread across the entire data sets. Such focused search queries and data separation are far more performant and requires less compute and memory footprint.
2020-12-01
Chen, S., Hu, W., Li, Z..  2019.  High Performance Data Encryption with AES Implementation on FPGA. 2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS). :149—153.

Nowadays big data has getting more and more attention in both the academic and the industrial research. With the development of big data, people pay more attention to data security. A significant feature of big data is the large size of the data. In order to improve the encryption speed of the large size of data, this paper uses the deep pipeline and full expansion technology to implement the AES encryption algorithm on FPGA. Achieved throughput of 31.30 Gbps with a minimum latency of 0.134 us. This design can quickly encrypt large amounts of data and provide technical support for the development of big data.

2020-11-23
Sutton, A., Samavi, R., Doyle, T. E., Koff, D..  2018.  Digitized Trust in Human-in-the-Loop Health Research. 2018 16th Annual Conference on Privacy, Security and Trust (PST). :1–10.
In this paper, we propose an architecture that utilizes blockchain technology for enabling verifiable trust in collaborative health research environments. The architecture supports the human-in-the-loop paradigm for health research by establishing trust between participants, including human researchers and AI systems, by making all data transformations transparent and verifiable by all participants. We define the trustworthiness of the system and provide an analysis of the architecture in terms of trust requirements. We then evaluate our architecture by analyzing its resiliency to common security threats and through an experimental realization.
2020-10-05
Hahn, Sebastian, Reineke, Jan.  2018.  Design and Analysis of SIC: A Provably Timing-Predictable Pipelined Processor Core. 2018 IEEE Real-Time Systems Symposium (RTSS). :469—481.

We introduce the strictly in-order core (SIC), a timing-predictable pipelined processor core. SIC is provably timing compositional and free of timing anomalies. This enables precise and efficient worst-case execution time (WCET) and multi-core timing analysis. SIC's key underlying property is the monotonicity of its transition relation w.r.t. a natural partial order on its microarchitectural states. This monotonicity is achieved by carefully eliminating some of the dependencies between consecutive instructions from a standard in-order pipeline design. SIC preserves most of the benefits of pipelining: it is only about 6-7% slower than a conventional pipelined processor. Its timing predictability enables orders-of-magnitude faster WCET and multi-core timing analysis than conventional designs.

2020-08-24
Quinn, Ren, Holguin, Nico, Poster, Ben, Roach, Corey, Merwe, Jacobus Kobus Van der.  2019.  WASPP: Workflow Automation for Security Policy Procedures. 2019 15th International Conference on Network and Service Management (CNSM). :1–5.

Every day, university networks are bombarded with attempts to steal the sensitive data of the various disparate domains and organizations they serve. For this reason, universities form teams of information security specialists called a Security Operations Center (SOC) to manage the complex operations involved in monitoring and mitigating such attacks. When a suspicious event is identified, members of the SOC are tasked to understand the nature of the event in order to respond to any damage the attack might have caused. This process is defined by administrative policies which are often very high-level and rarely systematically defined. This impedes the implementation of generalized and automated event response solutions, leading to specific ad hoc solutions based primarily on human intuition and experience as well as immediate administrative priorities. These solutions are often fragile, highly specific, and more difficult to reuse in other scenarios.

2020-08-03
Li, Guanyu, Zhang, Menghao, Liu, Chang, Kong, Xiao, Chen, Ang, Gu, Guofei, Duan, Haixin.  2019.  NETHCF: Enabling Line-rate and Adaptive Spoofed IP Traffic Filtering. 2019 IEEE 27th International Conference on Network Protocols (ICNP). :1–12.
In this paper, we design NETHCF, a line-rate in-network system for filtering spoofed traffic. NETHCF leverages the opportunity provided by programmable switches to design a novel defense against spoofed IP traffic, and it is highly efficient and adaptive. One key challenge stems from the restrictions of the computational model and memory resources of programmable switches. We address this by decomposing the HCF system into two complementary components-one component for the data plane and another for the control plane. We also aggregate the IP-to-Hop-Count (IP2HC) mapping table for efficient memory usage, and design adaptive mechanisms to handle end-to-end routing changes, IP popularity changes, and network activity dynamics. We have built a prototype on a hardware Tofino switch, and our evaluation demonstrates that NETHCF can achieve line-rate and adaptive traffic filtering with low overheads.
2020-03-30
Miao, Hui, Deshpande, Amol.  2019.  Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization. 2019 IEEE 35th International Conference on Data Engineering (ICDE). :1710–1713.
Increasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the resulting "provenance graphs" are quite verbose and evolving; further, it is very difficult for the users to compose queries and utilize this valuable information just using standard graph query model. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. We implement the operators on top of Neo4j, evaluate our query techniques extensively, and show the effectiveness and efficiency of the proposed methods.
2020-02-26
Inaba, Koutaro, Yoneda, Tomohiro, Kanamoto, Toshiki, Kurokawa, Atsushi, Imai, Masashi.  2019.  Hardware Trojan Insertion and Detection in Asynchronous Circuits. 2019 25th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC). :134–143.

Hardware Trojan threats caused by malicious designers and untrusted manufacturers have become one of serious issues in modern VLSI systems. In this paper, we show some experimental results to insert hardware Trojans into asynchronous circuits. As a result, the overhead of hardware Trojan insertion in asynchronous circuits may be small for malicious designers who have enough knowledge about the asynchronous circuits. In addition, we also show several Trojan detection methods using deep learning schemes which have been proposed to detect synchronous hardware Trojan in the netlist level. We apply them to asynchronous hardware Trojan circuits and show their results. They have a great potential to detect a hardware Trojan in asynchronous circuits.

2020-02-18
Yu, Bong-yeol, Yang, Gyeongsik, Jin, Heesang, Yoo, Chuck.  2019.  White Visor: Support of White-Box Switch in SDN-Based Network Hypervisor. 2019 International Conference on Information Networking (ICOIN). :242–247.

Network virtualization is a fundamental technology for datacenters and upcoming wireless communications (e.g., 5G). It takes advantage of software-defined networking (SDN) that provides efficient network management by converting networking fabrics into SDN-capable devices. Moreover, white-box switches, which provide flexible and fast packet processing, are broadly deployed in commercial datacenters. A white-box switch requires a specific and restricted packet processing pipeline; however, to date, there has been no SDN-based network hypervisor that can support the pipeline of white-box switches. Therefore, in this paper, we propose WhiteVisor: a network hypervisor which can support the physical network composed of white-box switches. WhiteVisor converts a flow rule from the virtual network into a packet processing pipeline compatible with the white-box switch. We implement the prototype herein and show its feasibility and effectiveness with pipeline conversion and overhead.

2019-10-14
Li, W., Ma, Y., Yang, Q., Li, M..  2018.  Hardware-Based Adversary-Controlled States Tracking. 2018 IEEE 4th International Conference on Computer and Communications (ICCC). :1366–1370.

Return Oriented Programming is one of the most important software security challenges nowadays. It exploits memory vulnerabilities to control the state of the program and hijacks its control flow. Existing defenses usually focus on how to protect the control flow or face the challenge of how to maintain the taint markings for memory data. In this paper, we directly focus on the adversary-controlled states, simplify the classic dynamic taint analysis method to only track registers and propose Hardware-based Adversary-controlled States Tracking (HAST). HAST dynamically tracks registers that may be controlled by the adversary to detect ROP attack. It is transparent to user application and makes few modifications to existing hardware. Our evaluation demonstrates that HAST will introduce almost no performance overhead and can effectively detect ROP attacks without false positives on the tested common Linux applications.

2019-09-30
Liu, B., He, L., Zhang, H., Sfarra, S., Fernandes, H., Perilli, S., Ren, J..  2019.  Research on stress detection technology of long-distance pipeline applying non-magnetic saturation. IET Science, Measurement Technology. 13:168–174.

In order to study the stress detection method on long-distance oil and gas pipeline, the distribution characteristics of the surface remanence signals in the stress concentration regions must be known. They were studied by using the magnetic domain model in the non-magnetic saturation state. The finite element method was used herein with the aim to analyse the static and mechanical characteristics of a ferromagnetic specimen. The variation law of remanence signal in stress concentration regions was simulated. The results show that a residue signal in the stress concentration region exists. In addition, a one-to-one correspondence in the non-magnetic saturation environment is evident. In the case of magnetic saturation, the remanence signal of the stress concentration region is covered and the signal cannot be recognised.

2019-09-26
Khatchadourian, R., Tang, Y., Bagherzadeh, M., Ahmed, S..  2019.  Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). :619-630.

Streaming APIs are becoming more pervasive in mainstream Object-Oriented programming languages. For example, the Stream API introduced in Java 8 allows for functional-like, MapReduce-style operations in processing both finite and infinite data structures. However, using this API efficiently involves subtle considerations like determining when it is best for stream operations to run in parallel, when running operations in parallel can be less efficient, and when it is safe to run in parallel due to possible lambda expression side-effects. In this paper, we present an automated refactoring approach that assists developers in writing efficient stream code in a semantics-preserving fashion. The approach, based on a novel data ordering and typestate analysis, consists of preconditions for automatically determining when it is safe and possibly advantageous to convert sequential streams to parallel and unorder or de-parallelize already parallel streams. The approach was implemented as a plug-in to the Eclipse IDE, uses the WALA and SAFE analysis frameworks, and was evaluated on 11 Java projects consisting of ?642K lines of code. We found that 57 of 157 candidate streams (36.31%) were refactorable, and an average speedup of 3.49 on performance tests was observed. The results indicate that the approach is useful in optimizing stream code to their full potential.

2019-07-01
Perez, R. Lopez, Adamsky, F., Soua, R., Engel, T..  2018.  Machine Learning for Reliable Network Attack Detection in SCADA Systems. 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). :633–638.

Critical Infrastructures (CIs) use Supervisory Control And Data Acquisition (SCADA) systems for remote control and monitoring. Sophisticated security measures are needed to address malicious intrusions, which are steadily increasing in number and variety due to the massive spread of connectivity and standardisation of open SCADA protocols. Traditional Intrusion Detection Systems (IDSs) cannot detect attacks that are not already present in their databases. Therefore, in this paper, we assess Machine Learning (ML) for intrusion detection in SCADA systems using a real data set collected from a gas pipeline system and provided by the Mississippi State University (MSU). The contribution of this paper is two-fold: 1) The evaluation of four techniques for missing data estimation and two techniques for data normalization, 2) The performances of Support Vector Machine (SVM), and Random Forest (RF) are assessed in terms of accuracy, precision, recall and F1score for intrusion detection. Two cases are differentiated: binary and categorical classifications. Our experiments reveal that RF detect intrusions effectively, with an F1score of respectively \textbackslashtextgreater 99%.

2018-11-19
Chelaramani, S., Jha, A., Namboodiri, A. M..  2018.  Cross-Modal Style Transfer. 2018 25th IEEE International Conference on Image Processing (ICIP). :2157–2161.

We, humans, have the ability to easily imagine scenes that depict sentences such as ``Today is a beautiful sunny day'' or ``There is a Christmas feel, in the air''. While it is hard to precisely describe what one person may imagine, the essential high-level themes associated with such sentences largely remains the same. The ability to synthesize novel images that depict the feel of a sentence is very useful in a variety of applications such as education, advertisement, and entertainment. While existing papers tackle this problem given a style image, we aim to provide a far more intuitive and easy to use solution that synthesizes novel renditions of an existing image, conditioned on a given sentence. We present a method for cross-modal style transfer between an English sentence and an image, to produce a new image that imbibes the essential theme of the sentence. We do this by modifying the style transfer mechanism used in image style transfer to incorporate a style component derived from the given sentence. We demonstrate promising results using the YFCC100m dataset.

Langfinger, M., Schneider, M., Stricker, D., Schotten, H. D..  2017.  Addressing Security Challenges in Industrial Augmented Reality Systems. 2017 IEEE 15th International Conference on Industrial Informatics (INDIN). :299–304.

In context of Industry 4.0 Augmented Reality (AR) is frequently mentioned as the upcoming interface technology for human-machine communication and collaboration. Many prototypes have already arisen in both the consumer market and in the industrial sector. According to numerous experts it will take only few years until AR will reach the maturity level to be deployed in productive applications. Especially for industrial usage it is required to assess security risks and challenges this new technology implicates. Thereby we focus on plant operators, Original Equipment Manufacturers (OEMs) and component vendors as stakeholders. Starting from several industrial AR use cases and the structure of contemporary AR applications, in this paper we identify security assets worthy of protection and derive the corresponding security goals. Afterwards we elaborate the threats industrial AR applications are exposed to and develop an edge computing architecture for future AR applications which encompasses various measures to reduce security risks for our stakeholders.

2018-05-09
Dering, M. L., Tucker, C. S..  2017.  Generative Adversarial Networks for Increasing the Veracity of Big Data. 2017 IEEE International Conference on Big Data (Big Data). :2595–2602.

This work describes how automated data generation integrates in a big data pipeline. A lack of veracity in big data can cause models that are inaccurate, or biased by trends in the training data. This can lead to issues as a pipeline matures that are difficult to overcome. This work describes the use of a Generative Adversarial Network to generate sketch data, such as those that might be used in a human verification task. These generated sketches are verified as recognizable using a crowd-sourcing methodology, and finds that the generated sketches were correctly recognized 43.8% of the time, in contrast to human drawn sketches which were 87.7% accurate. This method is scalable and can be used to generate realistic data in many domains and bootstrap a dataset used for training a model prior to deployment.