An error-resilient redundant subspace correction method
From MaRDI portal
Publication:2359624
Abstract: As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative linear solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) Maintain convergence when error occurs assuming it is detectable; (2) Introduce low computational overhead when no error occurs; (3) Require only small amount of local (point-to-point) communication compared to traditional methods and maintain good load balance; (4) Improve the mean time to failure. With the proposed method, we can improve reliability of many scientific and engineering applications. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method.
Recommendations
- Randomized and fault-tolerant method of subspace corrections
- Stochastic subspace correction methods and fault tolerance
- Algorithm-based error-detection schemes for iterative solution of partial differential equations
- Resilience for massively parallel multigrid solvers
- Numerical recovery strategies for parallel resilient Krylov linear solvers.
Cites work
- scientific article; zbMATH DE number 5711109 (Why is no real title available?)
- scientific article; zbMATH DE number 1953446 (Why is no real title available?)
- scientific article; zbMATH DE number 218047 (Why is no real title available?)
- scientific article; zbMATH DE number 218074 (Why is no real title available?)
- scientific article; zbMATH DE number 218075 (Why is no real title available?)
- scientific article; zbMATH DE number 2113718 (Why is no real title available?)
- A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
- A Parallel Implementation of an Iterative Substructuring Algorithm for Problems in Three Dimensions
- Algorithm-Based Fault Tolerance for Matrix Operations
- Algorithmic Fault Tolerance Using the Lanczos Method
- Elliptic differential equations: theory and numerical treatment. Transl. from the German by Regine Fadiman and Patrick D. F. Ion
- Exaflop/s: the why and the how
- Finite Element Methods in Mechanics
- Iterative Methods by Space Decomposition and Subspace Correction
- Iterative solution of large sparse systems of equations. Transl. from the German
- Numerical analysis of fixed point algorithms in the presence of hardware faults
- Parallel Multilevel Preconditioners
- Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
- The method of alternating projections and the method of subspace corrections in Hilbert space
Cited in
(9)- Algorithm-based error-detection schemes for iterative solution of partial differential equations
- Modifying the asynchronous Jacobi method for data corruption resilience
- Efficient measurement error correction with spatially misaligned data
- Stochastic subspace correction methods and fault tolerance
- Is the multigrid method fault tolerant? the two-grid case
- Resilience for massively parallel multigrid solvers
- Randomized and fault-tolerant method of subspace corrections
- The resiliency of multilevel methods on next-generation computing platforms: probabilistic model and its analysis
- ROBUST SUBSPACE CORRECTION METHODS FOR NEARLY SINGULAR SYSTEMS
This page was built for publication: An error-resilient redundant subspace correction method
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2359624)