Visible to the public Performance challenges for heterogeneous distributed tensor decompositions

TitlePerformance challenges for heterogeneous distributed tensor decompositions
Publication TypeConference Paper
Year of Publication2017
AuthorsRolinger, T. B., Simon, T. A., Krieger, C. D.
Conference Name2017 IEEE High Performance Extreme Computing Conference (HPEC)
Date Publishedsep
ISBN Number978-1-5386-3472-1
Keywordsalternating least squares fitting, canonical decomposition, compositionality, CP-ALS, cuSPARSE library, decomposition, DeFacTo, distributed memory, distributed memory systems, GPU, graphics processing units, heterogeneous distributed tensor decompositions, large-scale data analytics, Least squares approximations, Libraries, math kernel library, Matrix decomposition, matrix multiplication, message passing, Metrics, MPI, multi-threading, multidimensional arrays, OpenMP threads, parallel factorization, parallel processing, parallel programming, parallelization, performance evaluation, pubcrawl, ReFacTo, Signal processing algorithms, Sparse matrices, sparse matrix-vector multiplications, SpMV, Tensile stress, tensor decomposition, tensors

Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8x faster than ReFacTo when using cuSPARSE.

Citation Keyrolinger_performance_2017