Visible to the public Overlap Set Similarity Joins with Theoretical Guarantees

TitleOverlap Set Similarity Joins with Theoretical Guarantees
Publication TypeConference Paper
Year of Publication2018
AuthorsDeng, Dong, Tao, Yufei, Li, Guoliang
Conference NameProceedings of the 2018 International Conference on Management of Data
ISBN Number978-1-4503-4703-7
Keywordscomposability, compositionality, overlap, pubcrawl, scalable, set, similarity join, sub-quadratic, theoretical cryptography, theoretical guarantee
AbstractThis paper studies the set similarity join problem with overlap constraints which, given two collections of sets and a constant c, finds all the set pairs in the datasets that share at least c common elements. This is a fundamental operation in many fields, such as information retrieval, data mining, and machine learning. The time complexity of all existing methods is O(n2) where n is the total size of all the sets. In this paper, we present a size-aware algorithm with the time complexity of O(n2-over 1 c k1 over 2c)=o(n2)+O(k), where k is the number of results. The size-aware algorithm divides all the sets into small and large ones based on their sizes and processes them separately. We can use existing methods to process the large sets and focus on the small sets in this paper. We develop several optimization heuristics for the small sets to improve the practical performance significantly. As the size boundary between the small sets and the large sets is crucial to the efficiency, we propose an effective size boundary selection algorithm to judiciously choose an appropriate size boundary, which works very well in practice. Experimental results on real-world datasets show that our methods achieve high performance and outperform the state-of-the-art approaches by up to an order of magnitude.
Citation Keydeng_overlap_2018