Visible to the public Provenance

SoS Newsletter- Advanced Book Block


Provenance refers to information about the origin and activities of system data and processes. With the growth of shared services and systems, including social media, cloud computing, and service-oriented architectures, finding tamperproof methods for tracking files is a major challenge. Research into the security of software of unknown provenance (SOUP) is also included. The works cited here were presented between January and August 2014.

  • Rezvani, M.; Ignjatovic, A; Bertino, E.; Jha, S., "Provenance-aware Security Risk Analysis For Hosts And Network Flows," Network Operations and Management Symposium (NOMS), 2014 IEEE, vol., no., pp. 1, 8, 5-9 May 2014. doi: 10.1109/NOMS.2014.6838250 Detection of high risk network flows and high risk hosts is becoming ever more important and more challenging. In order to selectively apply deep packet inspection (DPI) one has to isolate in real time high risk network activities within a huge number of monitored network flows. To help address this problem, we propose an iterative methodology for a simultaneous assessment of risk scores for both hosts and network flows. The proposed approach measures the risk scores of hosts and flows in an interdependent manner; thus, the risk score of a flow influences the risk score of its source and destination hosts, and also the risk score of a host is evaluated by taking into account the risk scores of flows initiated by or terminated at the host. Our experimental results show that such an approach not only effective in detecting high risk hosts and flows but, when deployed in high throughput networks, is also more efficient than PageRank based algorithms.
    Keywords: computer network security ;risk analysis; deep packet inspection; high risk hosts; high risk network flows; provenance aware security risk analysis; risk score; Computational modeling; Educational institutions; Iterative methods; Monitoring; Ports (Computers); Risk management; Security (ID#:14-3023)
  • Beserra Sousa, R.; Cintra Cugler, D.; Gonzales Malaverri, J.E.; Bauzer Medeiros, C., "A Provenance-Based Approach To Manage Long Term Preservation Of Scientific Data," Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on , vol., no., pp.162,133, March 31 2014-April 4 2014. doi: 10.1109/ICDEW.2014.6818316 Long term preservation of scientific data goes beyond the data, and extends to metadata preservation and curation. While several researchers emphasize curation processes, our work is geared towards assessing the quality of scientific (meta)data. The rationale behind this strategy is that scientific data are often accessible via metadata - and thus ensuring metadata quality is a means to provide long term accessibility. This paper discusses our quality assessment architecture, presenting a case study on animal sound recording metadata. Our case study is an example of the importance of periodically assessing (meta)data quality, since knowledge about the world may evolve, and quality decrease with time, hampering long term preservation.
    Keywords: {data handling; meta data; animal sound recording metadata; long term scientific data preservation management; metadata curation process; metadata preservation; metadata quality; provenance-based approach; quality assessment architecture; Animals; Biodiversity; Computer architecture; Data models; Measurement; Quality assessment; Software (ID#:14-3024)
  • Rodes, B.D.; Knight, J.C., "Speculative Software Modification and its Use in Securing SOUP," Dependable Computing Conference (EDCC), 2014 Tenth European , vol., no., pp.210,221, 13-16 May 2014 doi: 10.1109/EDCC.2014.29 Abstract: We present an engineering process model for generating software modifications that is designed to be used when either most or all development artifacts about the software, including the source code, are unavailable. This kind of software, commonly called Software Of Unknown Provenance (SOUP), raises many doubts about the existence and adequacy of desired dependability properties, for example security. These doubts motivate some users to apply modifications to enhance dependability properties of the software, however, without necessary development artifacts, modifications are made in a state of uncertainty and risk. We investigate enhancing dependability through software modification in the presence of these risks as an engineering problem and introduce an engineering process for generating software modifications called Speculative Software Modification (SSM). We present the motivation and guiding principles of SSM, and a case study of SSM applied to protect software against buffer overflow attacks when only the binary is available.
    Keywords: security of data; software reliability; source code (software);SOUP security; SSM; software dependability property; software development artifacts; software engineering process model; software of unknown provenance; source code; speculative software modification; Complexity theory; Hardware; Maintenance engineering; Measurement; Security; Software; Uncertainty; Assurance Case; Security; Software Modification; Software Of Unknown Provenance (SOUP) (ID#:14-3025)
  • Dong Wang; Amin, M.T.; Shen Li; Abdelzaher, T.; Kaplan, L.; Siyu Gu; Chenji Pan; Liu, H.; Aggarwal, C.C.; Ganti, R.; Xinlei Wang; Mohapatra, P.; Szymanski, B.; Hieu Le, "Using Humans As Sensors: An Estimation-Theoretic Perspective," Information Processing in Sensor Networks, IPSN-14 Proceedings of the 13th International Symposium on , vol., no., pp.35,46, 15-17 April 2014. doi: 10.1109/IPSN.2014.6846739 The explosive growth in social network content suggests that the largest "sensor network" yet might be human. Extending the participatory sensing model, this paper explores the prospect of utilizing social networks as sensor networks, which gives rise to an interesting reliable sensing problem. In this problem, individuals are represented by sensors (data sources) who occasionally make observations about the physical world. These observations may be true or false, and hence are viewed as binary claims. The reliable sensing problem is to determine the correctness of reported observations. From a networked sensing standpoint, what makes this sensing problem formulation different is that, in the case of human participants, not only is the reliability of sources usually unknown but also the original data provenance may be uncertain. Individuals may report observations made by others as their own. The contribution of this paper lies in developing a model that considers the impact of such information sharing on the analytical foundations of reliable sensing, and embeds it into a tool called Apollo that uses Twitter as a "sensor network" for observing events in the physical world. Evaluation, using Twitter-based case-studies, shows good correspondence between observations deemed correct by Apollo and ground truth.
    Keywords: Internet; estimation theory; sensors; social networking (online); Apollo; Twitter-based case-studies; estimation-theoretic perspective ;humans; information sharing; largest sensor network; networked sensing standpoint; participatory sensing model; reliable sensing problem; sensing problem formulation; sensors; social network content; Computer network reliability; Maximum likelihood estimation; Reliability; Sensors; Silicon; Twitter; data reliability; expectation maximization; humans as sensors; maximum likelihood estimation; social sensing; uncertain data provenance}, (ID#:14-3026)
  • He, L.; Yue, P.; Di, L.; Zhang, M.; Hu, L., "Adding Geospatial Data Provenance into SDI--A Service-Oriented Approach," Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of, vol. PP, no.99, pp.1, 11, August 2014. doi: 10.1109/JSTARS.2014.2340737 Geospatial data provenance records the derivation history of a geospatial data product. It is important in evaluating the quality of data products. In a Geospatial Web Service environment where data are often disseminated and processed widely and frequently in an unpredictable way, it is even more important in identifying original data sources, tracing workflows, updating or reproducing scientific results, and evaluating reliability and quality of geospatial data products. Geospatial data provenance has become a fundamental issue in establishing the spatial data infrastructure (SDI). This paper investigates how to support provenance awareness in SDI. It addresses key issues including provenance modeling, capturing, and sharing in a SDI enabled by interoperable geospatial services. A reference architecture for provenance tracking is proposed, which can accommodate geospatial feature provenance at different levels of granularity. Open standards from ISO, World Wide Web Consortium (W3C), and OGC are leveraged to facilitate the interoperability. At the feature type level, this paper proposes extensions of W3C PROV-XML for ISO 19115 lineage and "Parent Level" provenance registration in the geospatial catalog service. At the feature instance level, light-weight lineage information entities for feature provenance are proposed and managed by Web Feature Services. Experiments demonstrate the applicability of the approach for creating provenance awareness in an interoperable geospatial service-oriented environment.
    Keywords: Catalogs; Geospatial analysis; ISO standards; Interoperability; Remote sensing; Web services; Geoprocessing workflow; Geospatial Web Service; ISO 19115 lineage; World Wide Web Consortium (W3C) PROV; geospatial data provenance; spatial data infrastructure (ID#:14-3027)
  • Zerva, P.; Zschaler, S.; Miles, S., "A Provenance Model of Composite Services in Service-Oriented Environments," Service Oriented System Engineering (SOSE), 2014 IEEE 8th International Symposium on, pp.1, 12, 7-11 April 2014. doi: 10.1109/SOSE.2014.8 Provenance awareness adds a new dimension to the engineering of service-oriented systems, requiring them to be able to answer questions about the provenance of any data produced. This need is even more evident where atomic services are aggregated into added-value composite services to be delivered with certain non-functional characteristics. Prior work in the area of provenance for service-oriented systems has primarily focused on the collection and storage infrastructure required for answering provenance questions. In contrast, in this paper we study the structure of the data thus collected considering the service's infrastructure as a whole and how this affects provenance collection for answering different types of provenance questions. In particular, we define an extension of W3Cs PROV ontological model with concepts that can be used to express the provenance of how services were discovered, selected, aggregated and executed. We demonstrate the conceptual adequacy of our model by reasoning over provenance instances for a composite service scenario.
    Keywords: data structures; ontologies (artificial intelligence); service-oriented architecture; W3C PROV ontological model; added-value composite services; atomic services; collection infrastructure; conceptual adequacy; data structure; nonfunctional characteristics; provenance awareness; service-oriented environments; storage infrastructure; Data models; Informatics; Ontologies; Protocols; Servers; Service-oriented architecture; ontology; provenance model; service composition; service-oriented systems (ID#:14-3028)
  • Imran, A; Nahar, N.; Sakib, K., "Watchword-oriented and Time-Stamped Algorithms For Tamper-Proof Cloud Provenance Cognition," Informatics, Electronics & Vision (ICIEV), 2014 International Conference on, vol., no., pp.1,6, 23-24 May 2014. doi: 10.1109/ICIEV.2014.6850747 Provenance is derivative journal information about the origin and activities of system data and processes. For a highly dynamic system like the cloud, provenance can be accurately detected and securely used in cloud digital forensic investigation activities. This paper proposes watchword oriented provenance cognition algorithm for the cloud environment. Additionally time-stamp based buffer verifying algorithm is proposed for securing the access to the detected cloud provenance. Performance analysis of the novel algorithms proposed here yields a desirable detection rate of 89.33% and miss rate of 8.66%. The securing algorithm successfully rejects 64% of malicious requests, yielding a cumulative frequency of 21.43 for MR.
    Keywords: cloud computing; digital forensics; formal verification; software performance evaluation; cloud digital forensic investigation activities; cloud security; derivative journal information; detection rate; miss rate; performance analysis; system data; system processes; tamper-proof cloud provenance cognition; time-stamp based buffer verifying algorithm; watchword oriented provenance cognition algorithm; Cloud computing; Cognition; Encryption; Informatics; Software as a service; Cloud computing; cloud security; empirical evaluation; provenance detection (ID#:14-3029)
  • Dong Dai; Yong Chen; Kimpe, D.; Ross, R., "Provenance-Based Prediction Scheme for Object Storage System in HPC," Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, pp.550,551, 26-29 May 2014. doi: 10.1109/CCGrid.2014.27 Object-based storage model is recently widely adopted both in industry and academia to support growingly data intensive applications in high-performance computing. However, the I/O prediction strategies which have been proven effective in traditional parallel file systems, have not been thoroughly studied under this new object-based storage model. There are new challenges introduced from object storage that make traditional prediction systems not work properly. In this paper, we propose a new I/O access prediction system based on provenance analysis on both applications and objects. We argue that the provenance, which contains metadata that describes the history of data, reveals the detailed information about applications and data sets, which can be used to capture the system status and provide accurate I/O prediction efficiently. Our current evaluations based on real-world trace data (Darshan datasets) simulation also confirm that provenance-based prediction system is able to provide accurate predictions for object storage systems.
    Keywords: meta data; object-oriented databases; parallel processing; storage management; Darshan datasets simulation; HPC; I/O access prediction system; I/O prediction strategy; data intensive application; high-performance computing; metadata; object storage system; object-based storage model; parallel file system; provenance analysis; provenance-based prediction scheme; provenance-based prediction system; real-world trace data; Accuracy; Algorithm design and analysis; Buildings; Clustering algorithms; Computer architecture; History; Prediction algorithms; I/O Prediction; Object Storage; Provenance (ID#:14-3030)
  • De Souza, L.; Marcon Gomes Vaz, M.S.; Sfair Sunye, M., "Modular Development of Ontologies for Provenance in Detrending Time Series," Information Technology: New Generations (ITNG), 2014 11th International Conference on, vol., no., pp.567, 572, 7-9 April 2014. doi: 10.1109/ITNG.2014.106 The scientific knowledge, in many areas, is obtained from time series analysis, which is usually done in two phases, preprocessing and data analysis. Trend extraction (detrending) is one important step of preprocessing phase, where many detrending software using different statistical methods can be applied for the same time series to correct them. In this context, the knowledge about time series data is relevant to the researcher to choose appropriate statistical methods to be used. Also the knowledge about how and how often the time series were corrected is essential for choice of detrending methods that can be applied to getting better results. This knowledge is not always explicit and easy to interpret. Provenance using Web Ontology Language - OWL ontologies contributes for helping the researcher to get knowledge about data and processes executed. Provenance information allows knowing as data were detrended, improving the decision making and contributing for generation of scientific knowledge. The main contribution of this paper is presenting the modular development of ontologies combined with Open Provenance Model - OPM, which is extended to facilitate the understanding about as detrending processes were executed in time series data, enriching semantically the preprocessing phase of time series analysis.
    Keywords: data analysis; decision making; knowledge representation languages; ontologies (artificial intelligence); time series; OPM; OWL ontologies; Web Ontology Language; decision making; detrending software; open provenance model; preprocessing phase; provenance information; scientific knowledge; scientific knowledge generation; time series data analysis; trend extraction; Analytical models; Market research; OWL; Ontologies; Semantics; Statistical analysis; Time series analysis; OWL; modules; provenance model; time series analysis; trend extraction (ID#:14-3031)
  • Hamadache, K.; Zerva, P., "Provenance of Feedback in Cloud Services," Service Oriented System Engineering (SOSE), 2014 IEEE 8th International Symposium on , vol., no., pp.23,34, 7-11 April 2014. doi: 10.1109/SOSE.2014.10 With the fast adoption of Services Computing, even more driven by the emergence of the Cloud, the need to ensure accountability for quality of service (QoS) for service-based systems/services has reached a critical level. This need has triggered numerous researches in the fields of trust, reputation and provenance. Most of the researches on trust and reputation have focused on their evaluation or computation. In case of provenance they have tried to track down how the service has processed and produced data during its execution. If some of them have investigated credibility models and mechanisms, only few have looked into the way reputation information is produced. In this paper we propose an innovative design for the evaluation of feedback authenticity and credibility by considering the feedback's provenance. This innovative consideration brings up a new level of security and trust in Services Computing, by fighting against malicious feedback and reducing the impact of irrelevant one.
    Keywords: cloud computing; trusted computing; QoS; cloud services; credibility models; feedback authenticity; feedback credibility; feedback provenance innovative design; malicious feedback; quality of service; reputation information; security; service-based systems/services; services computing; trust; Context; Hospitals; Monitoring; Ontologies; Quality of service; Reliability; Schedules; cloud computing; credibility; feedback; provenance; reputation (ID#:14-3032)
  • Dong Wang; Al Amin, M.T.; Abdelzaher, T.; Roth, D.; Voss, C.R.; Kaplan, L.M.; Tratz, S.; Laoudi, J.; Briesch, D., "Provenance-Assisted Classification in Social Networks," Selected Topics in Signal Processing, IEEE Journal of, vol.8, no.4, pp.624,637, Aug. 2014. doi: 10.1109/JSTSP.2014.2311586 Signal feature extraction and classification are two common tasks in the signal processing literature. This paper investigates the use of source identities as a common mechanism for enhancing the classification accuracy of social signals. We define social signals as outputs, such as microblog entries, geotags, or uploaded images, contributed by users in a social network. Many classification tasks can be defined on such outputs. For example, one may want to identify the dialect of a microblog contributed by an author, or classify information referred to in a user's tweet as true or false. While the design of such classifiers is application-specific, social signals share in common one key property: they are augmented by the explicit identity of the source. This motivates investigating whether or not knowing the source of each signal (in addition to exploiting signal features) allows the classification accuracy to be improved. We call it provenance-assisted classification. This paper answers the above question affirmatively, demonstrating how source identities can improve classification accuracy, and derives confidence bounds to quantify the accuracy of results. Evaluation is performed in two real-world contexts: (i) fact-finding that classifies microblog entries into true and false, and (ii) language classification of tweets issued by a set of possibly multi-lingual speakers. We also carry out extensive simulation experiments to further evaluate the performance of the proposed classification scheme over different problem dimensions. The results show that provenance features significantly improve classification accuracy of social signals, even when no information is known about the sources (besides their ID). This observation offers a general mechanism for enhancing classification results in social networks.
    Keywords: computational linguistics; feature extraction; maximum likelihood estimation; pattern classification; social networking (online); application-specific classifiers; maximum likelihood estimation; microblog; multilingual speakers; provenance-assisted classification; signal classification; signal feature extraction; signal processing; social network; social signals; tweet language classification; Accuracy; Equations; Mathematical model; Maximum likelihood estimation; Signal processing algorithms; Social network services; Social signals; classification; expectation maximization; maximum likelihood estimation; signal feature extraction; uncertain provenance (ID#:14-3033)
  • Gray, AJ.G., "Dataset Descriptions for Linked Data Systems," Internet Computing, IEEE , vol.18, no.4, pp.66,69, July-Aug. 2014. doi: 10.1109/MIC.2014.66 Linked data systems rely on the quality of, and linking between, their data sources. However, existing data is difficult to trace to its origin and provides no provenance for links. This article discusses the need for self-describing linked data.
    Keywords: data handling; data sources quality; dataset descriptions; linked data systems; self-describing linked data; Data systems; Electronic mail; Facsimile; Heating; Resource description framework; Vocabulary; data publishing; dataset descriptions; linked data; provenance (ID#:14-3034)
  • Jain, R.; Prabhakar, S., "Guaranteed Authenticity And Integrity Of Data From Untrusted Servers," Data Engineering (ICDE), 2014 IEEE 30th International Conference on, vol., no., pp.1282, 1285, March 31 2014-April 4 2014. doi: 10.1109/ICDE.2014.6816761 Data are often stored at untrusted database servers. The lack of trust arises naturally when the database server is owned by a third party, as in the case of cloud computing. It also arises if the server may have been compromised, or there is a malicious insider. Ensuring the trustworthiness of data retrieved from such untrusted database is of utmost importance. Trustworthiness of data is defined by faithful execution of valid and authorized transactions on the initial data. Earlier work on this problem is limited to cases where data are either not updated, or data are updated by a single trustworthy entity. However, for a truly dynamic database, multiple clients should be allowed to update data without having to route the updates through a central server. In this demonstration, we present a system to establish authenticity and integrity of data in a dynamic database where the clients can run transactions directly on the database server. Our system provides provable authenticity and integrity of data with absolutely no requirement for the server to be trustworthy. Our system also provides assured provenance of data. This demonstration is built using the solutions proposed in our previous work. Our system is built on top of Oracle with no modifications to the database internals. We show that the system can be easily adopted in existing databases without any internal changes to the database. We also demonstrate how our system can provide authentic provenance.
    Keywords: data integrity; database management systems; trusted computing; Oracle; cloud computing; data authenticity; data integrity; data provenance; data transactions; data trustworthiness; database internals; database servers; dynamic database; malicious insider; trustworthy entity; Cloud computing; Hardware; Indexes; Protocols; Servers (ID#:14-3035)


Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to SoS.Project (at) for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.