Visible to the public Biblio

Filters: Author is Abraham, Jacob A.  [Clear All Filters]
Abraham, Jacob A..  2019.  Resiliency Demands on Next Generation Critical Embedded Systems. 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS). :135–138.

Emerging intelligent systems have stringent constraints including cost and power consumption. When they are used in critical applications, resiliency becomes another key requirement. Much research into techniques for fault tolerance and dependability has been successfully applied to highly critical systems, such as those used in space, where cost is not an overriding constraint. Further, most resiliency techniques were focused on dealing with failures in the hardware and bugs in the software. The next generation of systems used in critical applications will also have to be tolerant to test escapes after manufacturing, soft errors and transients in the electronics, hardware bugs, hardware and software Trojans and viruses, as well as intrusions and other security attacks during operation. This paper will assess the impact of these threats on the results produced by a critical system, and proposed solutions to each of them. It is argued that run-time checks at the application-level are necessary to deal with errors in the results.

Cheng, Eric, Mirkhani, Shahrzad, Szafaryn, Lukasz G., Cher, Chen-Yong, Cho, Hyungmin, Skadron, Kevin, Stan, Mircea R., Lilja, Klas, Abraham, Jacob A., Bose, Pradip et al..  2016.  CLEAR: Cross-Layer Exploration for Architecting Resilience - Combining Hardware and Software Techniques to Tolerate Soft Errors in Processor Cores. Proceedings of the 53rd Annual Design Automation Conference. :68:1–68:6.

We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (798 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50× improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with \textasciitilde1% additional energy cost for a 50× improvement in silent data corruption rate).