An error-resilient redundant subspace correction method

DOI10.1007/S00791-016-0270-6zbMATH Open1364.65275arXiv1309.0212OpenAlexW1750964885MaRDI QIDQ2359624FDOQ2359624

Authors: Tao Cui, Jinchao Xu, Chensong Zhang

Publication date: 22 June 2017

Published in: Computing and Visualization in Science (Search for Journal in Brave)

Abstract: As we stride toward the exascale era, due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative linear solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) Maintain convergence when error occurs assuming it is detectable; (2) Introduce low computational overhead when no error occurs; (3) Require only small amount of local (point-to-point) communication compared to traditional methods and maintain good load balance; (4) Improve the mean time to failure. With the proposed method, we can improve reliability of many scientific and engineering applications. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method.

Full work available at URL: https://arxiv.org/abs/1309.0212

Recommendations

zbMATH Keywords

fault-tolerance Schwarz methods subspace correction error resilience

Mathematics Subject Classification ID

Multigrid methods; domain decomposition for boundary value problems involving PDEs (65N55) Finite element, Rayleigh-Ritz and Galerkin methods for boundary value problems involving PDEs (65N30)

Cites Work

Cited In (9)

This page was built for publication: An error-resilient redundant subspace correction method

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2359624)