Fault-Tolerance Techniques for High-Performance Computing

This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correcti...

Full description

Corporate Author: SpringerLink (Online service)
Other Authors: Herault, Thomas (Editor), Robert, Yves (Editor)
Format: eBook
Language:English
Published: Cham Springer International Publishing 2015, 2015
Edition:1st ed. 2015
Series:Computer Communications and Networks
Subjects:
Online Access:
Collection: Springer eBooks 2005- - Collection details see MPG.ReNa
Table of Contents:
  • General Overview
  • Fault-Tolerance Techniques for High-Performance Computing
  • Part II: Technical Contributions
  • Errors and Faults
  • Fault-Tolerant MPI
  • Using Replication for Resilience on Exascale Systems
  • Energy-Aware Check pointing Strategies