Visible to the public High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)

TitleHigh-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)
Publication TypeConference Paper
Year of Publication2019
AuthorsGliksberg, J., Capra, A., Louvet, A., García, P. J., Sohier, D.
Conference Name2019 IEEE Symposium on High-Performance Interconnects (HOTI)
Date Publishedaug
KeywordsA2A congestion, all-to-all traffic patterns, Clustering algorithms, computer networks, congestion risk, coupled congestion control, coupling regular topologies, Degradation, Dmodc, equipment failure, fabric management, Fabrics, fast deterministic routing algorithm, fat tree, Fat-tree networks, Fault resiliency, fault tolerant computing, forwarding tables, high-quality fault-resiliency, high-quality routing tables, HPC, HPC systems, InfiniBand control software, interconnection networks, large-scale HPC clusters, massive topology degradation, modulo-based computation, near-optimal SP congestion risk, Network topology, OpenSM, optimisation, optimized routing algorithms, Parallel Generalized Fat-Trees, parallel processing, PGFT, pre-modulo division, pubcrawl, random degradation, random permutation, resilience, Resiliency, Routing, routing tables, RP congestion, Runtime, Scalability, shift permutation, static analysis, telecommunication network routing, telecommunication network topology, telecommunication traffic, Topology, trees (mathematics)
AbstractCoupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete re-routing of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal SP congestion risk up to 1% of random degradation.
Citation Keygliksberg_high-quality_2019