Visible to the public Biblio

Filters: Author is Wu, Eugene  [Clear All Filters]
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
Krishnan, Sanjay, Franklin, Michael J., Goldberg, Ken, Wang, Jiannan, Wu, Eugene.  2016.  ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning. Proceedings of the 2016 International Conference on Management of Data. :2117–2120.

Databases can be corrupted with various errors such as missing, incorrect, or inconsistent values. Increasingly, modern data analysis pipelines involve Machine Learning, and the effects of dirty data can be difficult to debug.Dirty data is often sparse, and naive sampling solutions are not suited for high-dimensional models. We propose ActiveClean, a progressive framework for training Machine Learning models with data cleaning. Our framework updates a model iteratively as the analyst cleans small batches of data, and includes numerous optimizations such as importance weighting and dirty data detection. We designed a visual interface to wrap around this framework and demonstrate ActiveClean for a video classification problem and a topic modeling problem.

Psallidas, Fotis, Wu, Eugene.  2018.  Demonstration of Smoke: A Deep Breath of Data-Intensive Lineage Applications. Proceedings of the 2018 International Conference on Management of Data. :1781–1784.
Data lineage is a fundamental type of information that describes the relationships between input and output data items in a workflow. As such, an immense amount of data-intensive applications with logic over the input-output relationships can be expressed declaratively in lineage terms. Unfortunately, many applications resort to hand-tuned implementations because either lineage systems are not fast enough to meet their requirements or due to no knowledge of the lineage capabilities. Recently, we introduced a set of implementation design principles and associated techniques to optimize lineage-enabled database engines and realized them in our prototype database engine, namely, Smoke. In this demonstration, we showcase lineage as the building block across a variety of data-intensive applications, including tooltips and details on demand; crossfilter; and data profiling. In addition, we show how Smoke outperforms alternative lineage systems to meet or improve on existing hand-tuned implementations of these applications.
Psallidas, Fotis, Wu, Eugene.  2018.  Provenance for Interactive Visualizations. Proceedings of the Workshop on Human-In-the-Loop Data Analytics. :9:1–9:8.
We highlight the connections between data provenance and interactive visualizations. To do so, we first incrementally add interactions to a visualization and show how these interactions are readily expressible in terms of provenance. We then describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.
Wang, Xiaolan, Meliou, Alexandra, Wu, Eugene.  2016.  QFix: Demonstrating Error Diagnosis in Query Histories. Proceedings of the 2016 International Conference on Management of Data. :2177–2180.

An increasing number of applications in all aspects of society rely on data. Despite the long line of research in data cleaning and repairs, data correctness has been an elusive goal. Errors in the data can be extremely disruptive, and are detrimental to the effectiveness and proper function of data-driven applications. Even when data is cleaned, new errors can be introduced by applications and users who interact with the data. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Any discovered errors tend to be corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this demo proposal, we outline the design of QFix, a query-centric framework that derives explanations and repairs for discrepancies in relational data based on potential errors in the queries that operated on the data. This is a marked departure from traditional data-centric techniques that directly fix the data. We then describe how users will use QFix in a demonstration scenario. Participants will be able to select from a number of transactional benchmarks, introduce errors into the queries that are executed, and compare the fixes to the queries proposed by QFix as well as existing alternative algorithms such as decision trees.

Krishnan, Sanjay, Haas, Daniel, Franklin, Michael J., Wu, Eugene.  2016.  Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations. Proceedings of the Workshop on Human-In-the-Loop Data Analytics. :9:1–9:5.

Data cleaning is frequently an iterative process tailored to the requirements of a specific analysis task. The design and implementation of iterative data cleaning tools presents novel challenges, both technical and organizational, to the community. In this paper, we present results from a user survey (N = 29) of data analysts and infrastructure engineers from industry and academia. We highlight three important themes: (1) the iterative nature of data cleaning, (2) the lack of rigor in evaluating the correctness of data cleaning, and (3) the disconnect between the analysts who query the data and the infrastructure engineers who design the cleaning pipelines. We conclude by presenting a number of recommendations for future work in which we envision an interactive data cleaning system that accounts for the observed challenges.