Detecting silent data corruption for extreme-scale MPI applications

5Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.

Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is applicationagnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.

References Powered by Scopus

Designing reliable systems from unreliable components: The challenges of transistor variability and degradation

1049Citations
N/AReaders
Get full text

Algorithm-Based Fault Tolerance for Matrix Operations

933Citations
N/AReaders
Get full text

Automated synthesis of safe digital controllers for sampled-data stochastic nonlinear systems

729Citations
N/AReaders
Get full text

Cited by Powered by Scopus

EXAHD: An exa-scalable two-level sparse grid approach for higher-dimensional problems in plasma physics and beyond

6Citations
N/AReaders
Get full text

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

5Citations
N/AReaders
Get full text

Recovering Detectable Uncorrectable Errors via Spatial Data Prediction

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Bautista-Gomez, L., & Cappello, F. (2015). Detecting silent data corruption for extreme-scale MPI applications. In ACM International Conference Proceeding Series (Vol. 21-23-September-2015). Association for Computing Machinery. https://doi.org/10.1145/2802658.2802665

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 7

70%

Professor / Associate Prof. 1

10%

Lecturer / Post doc 1

10%

Researcher 1

10%

Readers' Discipline

Tooltip

Computer Science 6

55%

Engineering 3

27%

Physics and Astronomy 1

9%

Arts and Humanities 1

9%

Save time finding and organizing research with Mendeley

Sign up for free