Visible to the public Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data

TitleAutomatic Generation of Normalized Relational Schemas from Nested Key-Value Data
Publication TypeConference Paper
Year of Publication2016
AuthorsDiScala, Michael, Abadi, Daniel J.
Conference NameProceedings of the 2016 International Conference on Management of Data
Date PublishedJune 2016
Conference LocationNew York, NY, USA
ISBN Number978-1-4503-3531-7
KeywordsDeduplication, denormalized data, entity extraction, functional dependencies, functional dependency mining, JSON, key-value data, normalization, pubcrawl170201, relational databases, schema extraction, schema generation, schema matching, semistructured data, semistructured-to-relational mappings

Self-describing key-value data formats such as JSON are becoming increasingly popular as application developers choose to avoid the rigidity imposed by the relational model. Database systems designed for these self-describing formats, such as MongoDB, encourage users to use denormalized, heavily nested data models so that relationships across records and other schema information need not be predefined or standardized. Such data models contribute to long-term development complexity, as their lack of explicit entity and relationship tracking burdens new developers unfamiliar with the dataset. Furthermore, the large amount of data repetition present in such data layouts can introduce update anomalies and poor scan performance, which reduce both the quality and performance of analytics over the data. In this paper we present an algorithm that automatically transforms the denormalized, nested data commonly found in NoSQL systems into traditional relational data that can be stored in a standard RDBMS. This process includes a schema generation algorithm that discovers relationships across the attributes of the denormalized datasets in order to organize those attributes into relational tables. It further includes a matching algorithm that discovers sets of attributes that represent overlapping entities and merges those sets together. These algorithms reduce data repetition, allow the use of data analysis tools targeted at relational data, accelerate scan-intensive algorithms over the data, and help users gain a semantic understanding of complex, nested datasets.

Citation Keydiscala_automatic_2016