In today's climate, hyperscalers are reporting frequent silent data corruptions (SDCs)—i.e, silent errors or corrupt execution errors (CEEs)—in their cloud fleets caused by silicon manufacturing defects. Remarkably, SDCs at-scale are exhibiting error occurrence rates on the order of one fault in a thousand devices. Meanwhile, hardware manufacturers strive for, and intend to achieve, one hundred and close to zero defective parts per million for the commercial and automotive domains, respectively. Hundreds of thousands of servers in a large-scale infrastructure, featuring millions of hardware devices— e.g., motherboard, CPUs, DIMMs, GPUs, hardware accelerators, NICs, HDDs, flash drives, interconnect modules—coupled with unprecedented error rates means there is a non-negligible probability that SDCs will propagate to and impact system-level applications. Unfortunately, state-of-the-art fault injection studies, and the software resliency/fault tolerance approaches they support, assume that SDCs are a one in a million occurrence. In general, naïvely scaling existing SDC detection and mitigation techniques to compensate for error rates which are at least an order of magnitude higher is not viable from performance/efficiency perspectives. With no existing solutions ready to sub in, the challenge of SDCs at-scale calls for innovation spanning the entire hardware-software stack to guarantee high assurance for the important tasks to which we entrust computers today.
The aim of the 2023 NSF Workshop on Silent Data Corruption is to facilitate cross-disciplinary interactions of participants across the technical disciplines of hardware circuits, computer architecture, systems (including networking, operating systems, and distributed systems), and theory, as well as participants from key industry and government stakeholders, to propose novel solutions to post-manufacturing hardware testing, runtime error detection, software resilience and fault tolerance, system security, and to spearhead research on SDCs at-scale. Over the course of the two-day workshop, participants will contribute position papers, give live presentations, and participate in breakout discussions on the topics discussed in the proposal for this workshop and beyond. The core deliverable to NSF will be a final report detailing key research questions and directions which, if addressed, offer to alleviate challenges imposed by SDC at-scale. As a secondary deliverable, we will work with industry participants to identify and provide access to infrastructure and data which can support identified academic research efforts.
Sponsored by National Science Foundation Awards 2017863 and 2010810
Organizers
P R O G R A M C O - C H A I R S
CAROLINE TRIPPEL is an Assistant Professor in the Computer Science and Electrical Engineering Departments at Stanford University. Prior to starting at Stanford, she spent nine months as a Research Scientist at Facebook in the FAIR SysML group. Her research interests are in the area of computer architecture, with a focus on promoting correctness and security as first-order computer systems design metrics (akin to performance and power). A central theme of Caroline's work is leveraging formal methods techniques to design and verify hardware systems in order to ensure that they can provide correctness and security guarantees for the applications they intend to support. She has been recently exploring the role of architecture in enabling privacy-preserving machine learning, the role of machine learning in hardware systems optimizations, particularly in the context of neural recommendation, and opportunities for improving datacenter and at-scale machine learning reliability. Her research has influenced the design of the RISC-V ISA memory consistency model both via my formal analysis of its draft specification and her subsequent participation in the RISC-V Memory Model Task Group. Additionally, her work produced a novel methodology and tool that synthesized two new variants of the now-famous Meltdown and Spectre attacks. Caroline's research has been recognized with IEEE Top Picks distinctions, the 2020 ACM SIGARCH/IEEE CS TCCA Outstanding Dissertation Award, and the 2020 CGS/ProQuest® Distinguished Dissertation Award in Mathematics, Physical Sciences, & Engineering. She was also awarded an NVIDIA Graduate Fellowship (2017-2018) and selected to attend the 2018 MIT Rising Stars in EECS Workshop. She completed her PhD in Computer Science at Princeton University and her BS in Computer Engineering at Purdue University. | BARIS KASIKCI is a Morris Wellman Assistant Professor in the Electrical Engineering and Computer Science Department at the University of Michigan (since Sep. 2017). His research focuses on building efficient and trustworthy computer systems. He builds techniques to improve the efficiency of datacenter applications, provide systems support for heterogeneous computing platforms, analyze and fix failures, and improve the security of modern hardware. Building efficient and trustworthy systems requires a combination of approaches. His work draws insights from a broad set of disciplines such as systems, computer architecture, and programming languages. Baris is the recipient of an NSF CAREER award, a Microsoft Research Faculty Fellowship, an Intel Rising Star Award, a VMware Early Career Faculty Grant, a Google Faculty Award, and multiple Google and Intel Awards. Baris received the 2016 Roger Needham PhD Award for the best PhD thesis in computer systems in Europe and the 2016 Patrick Denantes Memorial Prize for best PhD thesis in the Department of Information and Communication Sciences at EPFL. Previously, he was a researcher in the Systems and Networking Group at Microsoft Research Cambridge. he also held roles at Intel, VMware and Siemens. More details can be found in his CV. |