Visible to the public Biblio

Found 8648 results

Nathan Malkin, Primal Wijesekera, Serge Egelman, David Wagner.  2018.  Use Case: Passively Listening Personal Assistants. Symposium on Applications of Contextual Integrity. :26-27.
Wijesekera, Primal.  2018.  Contextual permission models for better privacy protection. Electronic Theses and Dissertations (ETDs) 2008+.

Despite corporate cyber intrusions attracting all the attention, privacy breaches that we, as ordinary users, should be worried about occur every day without any scrutiny. Smartphones, a household item, have inadvertently become a major enabler of privacy breaches. Smartphone platforms use permission systems to regulate access to sensitive resources. These permission systems, however, lack the ability to understand users’ privacy expectations leaving a significant gap between how permission models behave and how users would want the platform to protect their sensitive data. This dissertation provides an in-depth analysis of how users make privacy decisions in the context of Smartphones and how platforms can accommodate user’s privacy requirements systematically. We first performed a 36-person field study to quantify how often applications access protected resources when users are not expecting it. We found that when the application requesting the permission is running invisibly to the user, they are more likely to deny applications access to protected resources. At least 80% of our participants would have preferred to prevent at least one permission request. To explore the feasibility of predicting user’s privacy decisions based on their past decisions, we performed a longitudinal 131-person field study. Based on the data, we built a classifier to make privacy decisions on the user’s behalf by detecting when the context has changed and inferring privacy preferences based on the user’s past decisions. We showed that our approach can accurately predict users’ privacy decisions 96.8% of the time, which is an 80% reduction in error rate compared to current systems. Based on these findings, we developed a custom Android version with a contextually aware permission model. The new model guards resources based on user’s past decisions under similar contextual circumstances. We performed a 38-person field study to measure the efficiency and usability of the new permission model. Based on exit interviews and 5M data points, we found that the new system is effective in reducing the potential violations by 75%. Despite being significantly more restrictive over the default permission systems, participants did not find the new model to cause any usability issues in terms of application functionality.

Reyes, Irwin, Wijesekera, Primal, Reardon, Joel, Elazari, Amit, Razaghpanah, Abbas, Vallina-Rodriguez, Narseo, Egelman, Serge.  2018.  “Won’t Somebody Think of the Children?” Examining COPPA Compliance at Scale Proceedings on Privacy Enhancing Technologies. 2018:63-83.

We present a scalable dynamic analysis framework that allows for the automatic evaluation of the privacy behaviors of Android apps. We use our system to analyze mobile apps’ compliance with the Children’s Online Privacy Protection Act (COPPA), one of the few stringent privacy laws in the U.S. Based on our automated analysis of 5,855 of the most popular free children’s apps, we found that a majority are potentially in violation of COPPA, mainly due to their use of thirdparty SDKs. While many of these SDKs offer configuration options to respect COPPA by disabling tracking and behavioral advertising, our data suggest that a majority of apps either do not make use of these options or incorrectly propagate them across mediation SDKs. Worse, we observed that 19% of children’s apps collect identifiers or other personally identifiable information (PII) via SDKs whose terms of service outright prohibit their use in child-directed apps. Finally, we show that efforts by Google to limit tracking through the use of a resettable advertising ID have had little success: of the 3,454 apps that share the resettable ID with advertisers, 66% transmit other, non-resettable, persistent identifiers as well, negating any intended privacy-preserving properties of the advertising ID.

Heigl, Michael, Schramm, Martin, Fiala, Dalibor.  2019.  A Lightweight Quantum-Safe Security Concept for Wireless Sensor Network Communication. 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). :906–911.
The ubiquitous internetworking of devices in all areas of life is boosted by various trends for instance the Internet of Things. Promising technologies that can be used for such future environments come from Wireless Sensor Networks. It ensures connectivity between distributed, tiny and simple sensor nodes as well as sensor nodes and base stations in order to monitor physical or environmental conditions such as vibrations, temperature or motion. Security plays an increasingly important role in the coming decades in which attacking strategies are becoming more and more sophisticated. Contemporary cryptographic mechanisms face a great threat from quantum computers in the near future and together with Intrusion Detection Systems are hardly applicable on sensors due to strict resource constraints. Thus, in this work a future-proof lightweight and resource-aware security concept for sensor networks with a processing stage permeated filtering mechanism is proposed. A special focus in the concepts evaluation lies on the novel Magic Number filter to mitigate a special kind of Denial-of-Service attack performed on CC1350 LaunchPad ARM Cortex-M3 microcontroller boards.
Vasiliu, Yevhen, Limar, Igor, Gancarczyk, Tomasz, Karpinski, Mikolaj.  2019.  New Quantum Secret Sharing Protocol Using Entangled Qutrits. 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS). 1:324–329.
A new quantum secret sharing protocol based on the ping-pong protocol of quantum secure direct communication is proposed. The pairs of entangled qutrits are used in protocol, which allows an increase in the information capacity compared with protocols based on entangled qubits. The detection of channel eavesdropping used in the protocol is being implemented in random moments of time, thereby it is possible do not use the significant amount of quantum memory. The security of the proposed protocol to attacks is considered. A method for additional amplification of the security to an eavesdropping attack in communication channels for the developed protocol is proposed.
Hu, Zhengbing, Vasiliu, Yevhen, Smirnov, Oleksii, Sydorenko, Viktoriia, Polishchuk, Yuliia.  2019.  Abstract Model of Eavesdropper and Overview on Attacks in Quantum Cryptography Systems. 2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS). 1:399–405.
In today's world, it's almost impossible to find a sphere of human life in which information technologies would not be used. On the one hand, it simplifies human life - virtually everyone carries a mini-computer in his pocket and it allows to perform many operations, that took a lot of time, in minutes. In addition, IT has simplified and promptly developed areas such as medicine, banking, document circulation, military, and many other infrastructures of the state. Nevertheless, even today, privacy remains a major problem in many information transactions. One of the most important directions for ensuring the information confidentiality in open communication networks has been and remains its protection by cryptographic methods. Although it is known that traditional cryptography methods give reasons to doubt in their reliability, quantum cryptography has proven itself as a more reliable information security technology. As far is it quite new direction there is no sufficiently complete classification of attacks on quantum cryptography methods, in view of this new extended classification of attacks on quantum protocols and quantum cryptosystems is proposed in this work. Classification takes into account the newest attacks (which use devices loopholes) on quantum key distribution equipment. These attacks have been named \textbackslashtextless; \textbackslashtextless; quantum hacking\textbackslashtextgreater\textbackslashtextgreater. Such classification may be useful for choosing commercially available quantum key distribution system. Also abstract model of eavesdropper in quantum systems was created and it allows to determine a set of various nature measures that need to be further implemented to provide reliable security with the help of specific quantum systems.
Brito, J. P., López, D. R., Aguado, A., Abellán, C., López, V., Pastor-Perales, A., la Iglesia, F. de, Martín, V..  2019.  Quantum Services Architecture in Softwarized Infrastructures. 2019 21st International Conference on Transparent Optical Networks (ICTON). :1–4.
Quantum computing is posing new threats on our security infrastructure. This has triggered a new research field on quantum-safe methods, and those that rely on the application of quantum principles are commonly referred as quantum cryptography. The most mature development in the field of quantum cryptography is called Quantum Key Distribution (QKD). QKD is a key exchange primitive that can replace existing mechanisms that can become obsolete in the near future. Although QKD has reached a high level of maturity, there is still a long path for a mass market implementation. QKD shall overcome issues such as miniaturization, network integration and the reduction of production costs to make the technology affordable. In this direction, we foresee that QKD systems will evolve following the same path as other networking technologies, where systems will run on specific network cards, integrable in commodity chassis. This work describes part of our activity in the EU H2020 project CiViQ in which quantum technologies, as QKD systems or quantum random number generators (QRNG), will become a single network element that we define as Quantum Switch. This allows for quantum resources (keys or random numbers) to be provided as a service, while the different components are integrated to cooperate for providing the most random and secure bit streams. Furthermore, with the purpose of making our proposal closer to current networking technology, this work also proposes an abstraction logic for making our Quantum Switch suitable to become part of software-defined networking (SDN) architectures. The model fits in the architecture of the SDN quantum node architecture, that is being under standardization by the European Telecommunications Standards Institute. It permits to operate an entire quantum network using a logically centralized SDN controller, and quantum switches to generate and to forward key material and random numbers across the entire network. This scheme, demonstrated for the first time at the Madrid Quantum Network, will allow for a faster and seamless integration of quantum technologies in the telecommunications infrastructure.
Dreher, Patrick, Ramasami, Madhuvanti.  2019.  Prototype Container-Based Platform for Extreme Quantum Computing Algorithm Development. 2019 IEEE High Performance Extreme Computing Conference (HPEC). :1–7.
Recent advances in the development of the first generation of quantum computing devices have provided researchers with computational platforms to explore new ideas and reformulate conventional computational codes suitable for a quantum computer. Developers can now implement these reformulations on both quantum simulators and hardware platforms through a cloud computing software environment. For example, the IBM Q Experience provides the direct access to their quantum simulators and quantum computing hardware platforms. However these current access options may not be an optimal environment for developers needing to download and modify the source codes and libraries. This paper focuses on the construction of a Docker container environment with Qiskit source codes and libraries running on a local cloud computing system that can directly access the IBM Q Experience. This prototype container based system allows single user and small project groups to do rapid prototype development, testing and implementation of extreme capability algorithms with more agility and flexibility than can be provided through the IBM Q Experience website. This prototype environment also provides an excellent teaching environment for labs and project assignments within graduate courses in cloud computing and quantum computing. The paper also discusses computer security challenges for expanding this prototype container system to larger groups of quantum computing researchers.
Diamanti, Eleni.  2019.  Demonstrating Quantum Advantage in Security and Efficiency with Practical Photonic Systems. 2019 21st International Conference on Transparent Optical Networks (ICTON). :1–2.
We discuss the current landscape in quantum communication and cryptography, and focus in particular on recent photonic implementations, using encoding in discrete or continuous properties of light, of central quantum network protocols, enabling secret key distribution, verification of entangled resources and transactions of quantum money, with maximal security guarantees. We also describe current challenges in this field and our efforts towards the miniaturization of the developed photonic systems, their integration into telecommunication network infrastructures, including with satellite links, as well as the practical demonstration of novel protocols featuring a quantum advantage in communication efficiency for a wide range of useful tasks in a network environment. These advances enrich the resources and applications of the emerging quantum networks that will play a central role in the context of future quantum-safe communications.
Mao, Huajian, Chi, Chenyang, Yu, Jinghui, Yang, Peixiang, Qian, Cheng, Zhao, Dongsheng.  2019.  QRStream: A Secure and Convenient Method for Text Healthcare Data Transferring. 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). :3458–3462.
With the increasing of health awareness, the users become more and more interested in their daily health information and healthcare activities results from healthcare organizations. They always try to collect them together for better usage. Traditionally, the healthcare data is always delivered by paper format from the healthcare organizations, and it is not easy and convenient for data usage and management. They would have to translate these data on paper to digital version which would probably introduce mistakes into the data. It would be necessary if there is a secure and convenient method for electronic health data transferring between the users and the healthcare organizations. However, for the security and privacy problems, almost no healthcare organization provides a stable and full service for health data delivery. In this paper, we propose a secure and convenient method, QRStream, which splits original health data and loads them onto QR code frame streaming for the data transferring. The results shows that QRStream can transfer text health data smoothly with an acceptable performance, for example, transferring 10K data in 10 seconds.
Li, Jian, Zhang, Zelin, Li, Shengyu, Benton, Ryan, Huang, Yulong, Kasukurthi, Mohan Vamsi, Li, Dongqi, Lin, Jingwei, Borchert, Glen M., Tan, Shaobo et al..  2019.  Reversible Data Hiding Based Key Region Protection Method in Medical Images. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). :1526–1530.
The transmission of medical image data in an open network environment is subject to privacy issues including patient privacy and data leakage. In the past, image encryption and information-hiding technology have been used to solve such security problems. But these methodologies, in general, suffered from difficulties in retrieving original images. We present in this paper an algorithm to protect key regions in medical images. First, coefficient of variation is used to locate the key regions, a.k.a. the lesion areas, of an image; other areas are then processed in blocks and analyzed for texture complexity. Next, our reversible data-hiding algorithm is used to embed the contents from the lesion areas into a high-texture area, and the Arnold transformation is performed to protect the original lesion information. In addition to this, we use the ciphertext of the basic information about the image and the decryption parameter to generate the Quick Response (QR) Code to replace the original key regions. Consequently, only authorized customers can obtain the encryption key to extract information from encrypted images. Experimental results show that our algorithm can not only restore the original image without information loss, but also safely transfer the medical image copyright and patient-sensitive information.
Huang, Jinjing, Cheng, Shaoyin, Lou, Songhao, Jiang, Fan.  2019.  Image steganography using texture features and GANs. 2019 International Joint Conference on Neural Networks (IJCNN). :1–8.
As steganography is the main practice of hidden writing, many deep neural networks are proposed to conceal secret information into images, whose invisibility and security are unsatisfactory. In this paper, we present an encoder-decoder framework with an adversarial discriminator to conceal messages or images into natural images. The message is embedded into QR code first which significantly improves the fault-tolerance. Considering the mean squared error (MSE) is not conducive to perfectly learn the invisible perturbations of cover images, we introduce a texture-based loss that is helpful to hide information into the complex texture regions of an image, improving the invisibility of hidden information. In addition, we design a truncated layer to cope with stego image distortions caused by data type conversion and a moment layer to train our model with varisized images. Finally, our experiments demonstrate that the proposed model improves the security and visual quality of stego images.
Mashaly, Maggie, El Saied, Ahmed, Alexan, Wassim, Khalifa, Abeer S..  2019.  A Multiple Layer Security Scheme Utilizing Information Matrices. 2019 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA). :284–289.
This paper proposes a double-layer message security scheme that is implemented in two stages. First, the secret data is encrypted using the AES algorithm with a 256-bit key. Second, least significant bit (LSB) embedding is carried out, by hiding the secret message into an image of an information matrix. A number of performance evaluation metrics are discussed and computed for the proposed scheme. The obtained results are compared to other schemes in literature and show the superiority of the proposed scheme.
Ximenes, Agostinho Marques, Sukaridhoto, Sritrusta, Sudarsono, Amang, Ulil Albaab, Mochammad Rifki, Basri, Hasan, Hidayat Yani, Muhammad Aksa, Chang Choon, Chew, Islam, Ezharul.  2019.  Implementation QR Code Biometric Authentication for Online Payment. 2019 International Electronics Symposium (IES). :676–682.
Based on the Indonesian of Statistics the level of society people in 2019 is grow up. Based on data, the bank conducted a community to simple transaction payment in the market. Bank just used a debit card or credit card for the transaction, but the banks need more investment for infrastructure and very expensive. Based on that cause the bank needs another solution for low-cost infrastructure. Obtained from solutions that, the bank implementation QR Code Biometric authentication Payment Online is one solution that fulfills. This application used for payment in online merchant. The transaction permits in this study lie in the biometric encryption, or decryption transaction permission and QR Code Scan to improve communication security and transaction data. The test results of implementation Biometric Cloud Authentication Platform show that AES 256 agents can be implemented for face biometric encryption and decryption. Code Scan QR to carry out transaction permits with Face verification transaction permits gets the accuracy rate of 95% for 10 sample people and transaction process gets time speed of 53.21 seconds per transaction with a transaction sample of 100 times.
Khan, Abdul Ghaffar, Zahid, Amjad Hussain, Hussain, Muzammil, Riaz, Usama.  2019.  Security Of Cryptocurrency Using Hardware Wallet And QR Code. 2019 International Conference on Innovative Computing (ICIC). :1–10.
Today, the privacy and the security of any organization are the key requirement, the digital online transaction of money or coins also needed a certain level of security not only during the broadcasting of the transaction but before the sending of the transaction. In this research paper we proposed and implemented a cryptocurrency (Bitcoin) wallet for the android operating system, by using the QR code-based android application and a secure private key storage (Cold Wallet). Two android applications have been implemented one of them is called cold wallet and the other one is hot wallet. Cold wallet (offline) is to store and generate the private key addresses for secure transaction confirmation and the hot wallet is used to send bitcoin to the network. Hot wallet application gives facility to the user view history of performed transactions, to send and compose a new bitcoin transaction, receive bitcoin, sign it and send it to the network. By using the process of cross QR code scanning of the hot and cold wallet to the identification, validation and authentication of the user made it secure.
Ahamed, Md. Salahuddin, Asiful Mustafa, Hossen.  2019.  A Secure QR Code System for Sharing Personal Confidential Information. 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2). :1–4.
Securing and hiding personal confidential information has become a challenge in these modern days. Due to the lack of security and confidentiality, forgery of confidential information can cause a big margin loss to a person. Personal confidential information needs to be securely shared and hidden with the expected recipient and he should be able to verify the information by checking its authenticity. QR codes are being used increasingly to share data for different purposes. In information communication, QR code is important because of its high data capacity. However, most existing QR code systems use insecure data format and encryption is rarely used. A user can use Secure QR Code (SQRC) technology to keep information secured and hidden. In this paper, we propose a novel SQRC system which will allow sharing authentic personal confidential information by means of QR code verification using RSA digital signature algorithm and also allow authorizing the information by means of QR code validation using RSA public key cryptographic algorithm. We implemented the proposed SQRC system and showed that the system is effective for sharing personal confidential information securely.
Jin, Yong, Tomoishi, Masahiko.  2019.  Encrypted QR Code Based Optical Challenge-Response Authentication by Mobile Devices for Mounting Concealed File System. 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 2:676–681.
Nowadays mobile devices have become the majority terminals used by people for social activities so that carrying business data and private information in them have become normal. Accordingly, the risk of data related cyber attacks has become one of the most critical security concerns. The main purpose of this work is to mitigate the risk of data breaches and damages caused by malware and the lost of mobile devices. In this paper, we propose an encrypted QR code based optical challenge-response authentication by mobile devices for mounting concealed file systems. The concealed file system is basically invisible to the users unless being successfully mounted. The proposed authentication scheme practically applies cryptography and QR code technologies to challenge-response scheme in order to secure the concealed file system. The key contribution of this work is to clarify a possibility of a mounting authentication scheme involving two mobile devices using a special optical communication way (QR code exchanges) which can be realizable without involving any network accesses. We implemented a prototype system and based on the preliminary feature evaluations results we confirmed that encrypted QR code based optical challenge-response is possible between a laptop and a smart phone and it can be applied to authentication for mounting concealed file systems.
Verma, Rajat Singh, Chandavarkar, B. R., Nazareth, Pradeep.  2019.  Mitigation of hard-coded credentials related attacks using QR code and secured web service for IoT. 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). :1–5.
Hard-coded credentials such as clear text log-in id and password provided by the IoT manufacturers and unsecured ways of remotely accessing IoT devices are the major security concerns of industry and academia. Limited memory, power, and processing capabilities of IoT devices further worsen the situations in improving the security of IoT devices. In such scenarios, a lightweight security algorithm up to some extent can minimize the risk. This paper proposes one such approach using Quick Response (QR) code to mitigate hard-coded credentials related attacks such as Mirai malware, wreak havoc, etc. The QR code based approach provides non-clear text unpredictable login id and password. Further, this paper also proposes a secured way of remotely accessing IoT devices through modified https. The proposed algorithms are implemented and verified using Raspberry Pi 3 model B.
Abdolahi, Mahssa, Jiang, Hao, Kaminska, Bozena.  2019.  Robust data retrieval from high-security structural colour QR codes via histogram equalization and decorrelation stretching. 2019 IEEE 10th Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON). :0340–0346.
In this work, robust readout of the data (232 English characters) stored in high-security structural colour QR codes, was achieved by using multiple image processing techniques, specifically, histogram equalization and decorrelation stretching. The decoded structural colour QR codes are generic diffractive RGB-pixelated periodic nanocones selectively activated by laser exposure to obtain the particular design of interest. The samples were imaged according to the criteria determined by the diffraction grating equation for the lighting and viewing angles given the red, green, and blue periodicities of the grating. However, illumination variations all through the samples, cross-module and cross-channel interference effects result in acquiring images with dissimilar lighting conditions which cannot be directly retrieved by the decoding script and need significant preprocessing. According to the intensity plots, even if the intensity values are very close (above 200) at some typical regions of the images with different lighting conditions, their inconsistencies (below 100) at the pixels of one representative region may lead to the requirement for using different methods for recovering the data from all red, green, and blue channels. In many cases, a successful data readout could be achieved by downscaling the images to 300-pixel dimensions (along with bilinear interpolation resampling), histogram equalization (HE), linear spatial low-pass mean filtering, and gamma function, each used either independently or with other complementary processes. The majority of images, however, could be fully decoded using decorrelation stretching (DS) either as a standalone or combinational process for obtaining a more distinctive colour definition.
Bharati, Aparna, Moreira, Daniel, Brogan, Joel, Hale, Patricia, Bowyer, Kevin, Flynn, Patrick, Rocha, Anderson, Scheirer, Walter.  2019.  Beyond Pixels: Image Provenance Analysis Leveraging Metadata. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). :1692–1702.
Creative works, whether paintings or memes, follow unique journeys that result in their final form. Understanding these journeys, a process known as "provenance analysis," provides rich insights into the use, motivation, and authenticity underlying any given work. The application of this type of study to the expanse of unregulated content on the Internet is what we consider in this paper. Provenance analysis provides a snapshot of the chronology and validity of content as it is uploaded, re-uploaded, and modified over time. Although still in its infancy, automated provenance analysis for online multimedia is already being applied to different types of content. Most current works seek to build provenance graphs based on the shared content between images or videos. This can be a computationally expensive task, especially when considering the vast influx of content that the Internet sees every day. Utilizing non-content-based information, such as timestamps, geotags, and camera IDs can help provide important insights into the path a particular image or video has traveled during its time on the Internet without large computational overhead. This paper tests the scope and applicability of metadata-based inferences for provenance graph construction in two different scenarios: digital image forensics and cultural analytics.
Scherzinger, Stefanie, Seifert, Christin, Wiese, Lena.  2019.  The Best of Both Worlds: Challenges in Linking Provenance and Explainability in Distributed Machine Learning. 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). :1620–1629.
Machine learning experts prefer to think of their input as a single, homogeneous, and consistent data set. However, when analyzing large volumes of data, the entire data set may not be manageable on a single server, but must be stored on a distributed file system instead. Moreover, with the pressing demand to deliver explainable models, the experts may no longer focus on the machine learning algorithms in isolation, but must take into account the distributed nature of the data stored, as well as the impact of any data pre-processing steps upstream in their data analysis pipeline. In this paper, we make the point that even basic transformations during data preparation can impact the model learned, and that this is exacerbated in a distributed setting. We then sketch our vision of end-to-end explainability of the model learned, taking the pre-processing into account. In particular, we point out the potentials of linking the contributions of research on data provenance with the efforts on explainability in machine learning. In doing so, we highlight pitfalls we may experience in a distributed system on the way to generating more holistic explanations for our machine learning models.
Thida, Aye, Shwe, Thanda.  2020.  Process Provenance-based Trust Management in Collaborative Fog Environment. 2020 IEEE Conference on Computer Applications(ICCA). :1–5.
With the increasing popularity and adoption of IoT technology, fog computing has been used as an advancement to cloud computing. Although trust management issues in cloud have been addressed, there are still very few studies in a fog area. Trust is needed for collaborating among fog nodes and trust can further improve the reliability by assisting in selecting the fog nodes to collaborate. To address this issue, we present a provenance based trust mechanism that traces the behavior of the process among fog nodes. Our approach adopts the completion rate and failure rate as the process provenance in trust scores of computing workload, especially obvious measures of trustworthiness. Simulation results demonstrate that the proposed system can effectively be used for collaboration in a fog environment.
Souza, Renan, Azevedo, Leonardo, Lourenço, Vítor, Soares, Elton, Thiago, Raphael, Brandão, Rafael, Civitarese, Daniel, Brazil, Emilio, Moreno, Marcio, Valduriez, Patrick et al..  2019.  Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering. 2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS). :1–10.
Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.