Visible to the public Raw Cardinality Information Discovery for Big Datasets

TitleRaw Cardinality Information Discovery for Big Datasets
Publication TypeConference Paper
Year of Publication2019
AuthorsKumar, S., Vasthimal, D. K.
Conference Name2019 IEEE 5th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)
Keywordsback-end Big Data systems, Big Data, Big Data sets, Cardinality, cloud computing, compositionality, Conferences, Data analysis, data mining, data separation, elastic, elasticsearch, events, grafana, Hadoop, HDFS, Java, logs, Map Reduce, Measurement, metadata, Metadata Discovery Problem, Metrics, Microsoft Windows, Monitoring, parallel discovery data store infrastructure, Pipelines, pubcrawl, query processing, raw cardinality information discovery, resilience, Resiliency, rocksdb, Scalability, search queries, Topology
AbstractReal-time discovery of all different types of unique attributes within unstructured data is a challenging problem to solve when dealing with multiple petabytes of unstructured data volume everyday. Popular discovery solutions such as the creation of offline jobs to uniquely identify attributes or running aggregation queries on raw data sets limits real time discovery use-cases and often results into poor resource utilization. The discovery information must be treated as a parallel problem to just storing raw data sets efficiently onto back-end big data systems. Solving the discovery problem by creating a parallel discovery data store infrastructure has multiple benefits as it allows such to channel the actual search queries against the raw data set in much more funneled manner instead of being widespread across the entire data sets. Such focused search queries and data separation are far more performant and requires less compute and memory footprint.
Citation Keykumar_raw_2019