Communication-optimal parallel and sequential QR and LU factorizations
From MaRDI portal
numerical examplesparallel algorithmparallel computationcomparison of methodsLU factorizationQR factorizationrectangular matrices``non-Strassen-like QRHouseholder QRScaLAPACK algorithms
Direct numerical methods for linear systems and matrix inversion (65F05) Parallel numerical computation (65Y05) Complexity and performance of numerical algorithms (65Y20) Factorization of matrices (15A23) Numerical solutions to overdetermined systems, pseudoinverses (65F20) Orthogonalization in numerical linear algebra (65F25)
Abstract: We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. We prove optimality by extending known lower bounds on communication bandwidth for sequential and parallel matrix multiplication to provide latency lower bounds, and show these bounds apply to the LU and QR decompositions. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We also point out recent LU algorithms in the literature that attain at least some of these lower bounds.
Recommendations
Cited in
(95)- An improved shifted CholeskyQR based on columns
- Analyzing the dissemination of news by model averaging and subsampling
- CholeskyQR with randomization and pivoting for tall matrices (CQRRPT)
- Simultaneous multidiagonalization for the CS decomposition
- Distributed-parallel proper orthogonal/dynamic mode decompositions of large flow data
- Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs
- A parallel and streaming dynamic mode decomposition algorithm with finite precision error analysis for large data
- A block Cholesky‐LU‐based QR factorization for rectangular matrices
- Robust and accurate stopping criteria for adaptive randomized sampling in matrix-free hierarchically semiseparable construction
- Avoiding Communication in Primal and Dual Block Coordinate Descent Methods
- Sparse linear least-squares problems
- Efficient image reconstruction via regularized variable s-step conjugate gradient method for Sylvester matrix equations
- Probabilistic rounding error analysis of modified Gram-Schmidt
- Augmented flexible Krylov subspace methods with applications to Bayesian inverse problems
- Backward error analysis of the AllReduce algorithm for Householder QR decomposition
- Parallel Algorithms for Tensor Train Arithmetic
- Parallel \textit{QR} factorization of block-tridiagonal matrices
- Benefits from using mixed precision computations in the ELPA-AEO and ESSEX-II eigensolver projects
- Low Rank Approximation of a Sparse Matrix Based on LU Factorization with Column and Row Tournament Pivoting
- On sampling determinantal and Pfaffian point processes on a quantum computer
- A parallel algorithm for calculation of determinants and minors using arbitrary precision arithmetic
- Randomized numerical linear algebra: Foundations and algorithms
- GMRES with embedded ensemble propagation for the efficient solution of parametric linear systems in uncertainty quantification of computational models
- Considerations on the Implementation and Use of Anderson Acceleration on Distributed Memory and GPU-based Parallel Computers
- Linear-time CUR approximation of BEM matrices
- Adaptively restarted block Krylov subspace methods with low-synchronization skeletons
- Iteratively reweighted FGMRES and FLSQR for sparse reconstruction
- Detecting interactions in high-dimensional data using cross leverage scores
- Gram-Schmidt orthogonalization: 100 years and more
- TTDFT: a GPU accelerated Tucker tensor DFT code for large-scale Kohn-Sham DFT calculations
- Communication-efficient distributed statistical inference
- Randomized algorithms for distributed computation of principal component analysis and singular value decomposition
- Scaling LAPACK panel operations using parallel cache assignment
- A fast randomized algorithm for computing an approximate null space
- Randomized block Gram-Schmidt process for the solution of linear systems and eigenvalue problems
- Numerical algorithms for high-performance computational science
- Communication avoiding rank revealing QR factorization with column pivoting
- Linear algebra software for large-scale accelerated multicore computing
- Developing variable \(s\)-step CGNE and CGNR algorithms for non-symmetric linear systems
- Task-based parallel programming for scalable matrix product algorithms
- Finding solution of linear systems via new forms of BiCG, BiCGstab and CGS algorithms
- A LAPACK implementation of the dynamic mode decomposition
- Hermitian dynamic mode decomposition -- numerical analysis and software solution
- Block Modified Gram--Schmidt Algorithms and Their Analysis
- Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
- Nonnegative diagonals and high performance on low-profile matrices from Householder QR
- Parallel \(\mathcal {H}\)-matrix arithmetic on distributed-memory systems
- Scaling up parallel computation of tiled QR factorizations by a distributed scheduling runtime system and analytical modeling
- Introduction to communication avoiding algorithms for direct methods of factorization in linear algebra
- Performance of the low-rank TT-SVD for large dense tensors on modern multicore CPUs
- A randomized blocked algorithm for efficiently computing rank-revealing factorizations of matrices
- On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt
- A distributed and incremental SVD algorithm for agglomerative data analysis on large networks
- Active control of the flow past a circular cylinder using online dynamic mode decomposition
- A novel parallel algorithm based on the Gram-Schmidt method for tridiagonal linear systems of equations
- Adapting regularized low-rank models for parallel architectures
- Fast multipole preconditioners for sparse matrices arising from elliptic equations
- FaIMS: a fast algorithm for the inverse medium problem with multiple frequencies and multiple sources for the scalar Helmholtz equation
- Data Driven Modal Decompositions: Analysis and Enhancements
- Incremental model order reduction of smoothed-particle hydrodynamic simulations
- Increasing the performance of the Jacobi-Davidson method by blocking
- Randomized Projection for Rank-Revealing Matrix Factorizations and Low-Rank Approximations
- A numerically stable communication-avoiding \(s\)-step GMRES algorithm
- A new modified variable s-step BiCGSTAB method with regularization for solving shifted linear systems
- Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems
- High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations
- Communication-optimal parallel and sequential Cholesky decomposition
- Communication lower bounds and optimal algorithms for numerical linear algebra
- Scalable linear solvers based on enlarged Krylov subspaces with dynamic reduction of search directions
- Randomised subspace system identification: complexities and error bounds
- Distributed generalized linear models: a privacy-preserving approach
- GPU parameter tuning for tall and skinny dense linear least squares problems
- Householder Orthogonalization with a Nonstandard Inner Product
- Shifted Cholesky QR for computing the QR factorization of ill-conditioned matrices
- Performance Analysis of the Householder-Type Parallel Tall-Skinny QR Factorizations Toward Automatic Algorithm Selection
- LU factorization with panel rank revealing pivoting and its communication avoiding version
- Rounding error analysis of mixed precision block Householder QR algorithms
- Towards dense linear algebra for hybrid GPU accelerated manycore systems
- Random projections for Bayesian regression
- Analysis of randomized Householder-Cholesky QR factorization with multisketching
- A novel adaptive low-rank matrix approximation method for image compression and reconstruction
- Communication-avoiding symmetric-indefinite factorization
- Randomized QR with column pivoting
- Varying the \(s\) in your \(s\)-step GMRES
- CALU: A communication optimal LU factorization algorithm
- GMRES algorithms over 35 years
- Block Gram-Schmidt algorithms and their stability properties
- Communication Avoiding ILU0 Preconditioner
- Mixed Precision Iterative Refinement with Sparse Approximate Inverse Preconditioning
- Cholesky and Gram-Schmidt orthogonalization for tall-and-skinny QR factorizations on graphics processors
- A distributed block Chebyshev-Davidson algorithm for parallel spectral clustering
- The singular value decomposition: anatomy of optimizing an algorithm for extreme scale
- The red-blue pebble game on trees and DAGs with large input
- Variable s-step technique for planar algorithms in solving indefinite linear systems
- Enlarged Krylov subspace conjugate gradient methods for reducing communication
This page was built for publication: Communication-optimal parallel and sequential QR and LU factorizations
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2882786)