Anatomy of high-performance matrix multiplication
From MaRDI portal
Publication:3549230
DOI10.1145/1356052.1356053zbMath1190.65064OpenAlexW2073061372WikidataQ56455012 ScholiaQ56455012MaRDI QIDQ3549230
Kazushige Goto, Robert A. van de Geijn
Publication date: 21 December 2008
Published in: ACM Transactions on Mathematical Software (Search for Journal in Brave)
Full work available at URL: https://doi.org/10.1145/1356052.1356053
Related Items
Scientific computations on multi-core systems using different programming frameworks ⋮ Towards an efficient use of the BLAS library for multilinear tensor contractions ⋮ Multidimensional Array Data Management ⋮ Oscillatory Convection in Rotating Spherical Shells: Low Prandtl Number and Non-Slip Boundary Conditions ⋮ Penalized splines for smooth representation of high-dimensional Monte Carlo datasets ⋮ Fast verified solutions of linear systems ⋮ A comparison of high-order time integrators for thermal convection in rotating spherical shells ⋮ The matrix reloaded: multiplication strategies in FrodoKEM ⋮ Heterogeneous computing on mixed unstructured grids with pyfr ⋮ PARFES: A method for solving finite element linear equations on multi-core computers ⋮ The evaluation of American options in a stochastic volatility model with jumps: an efficient finite element approach ⋮ Parallel Matrix Multiplication: A Systematic Journey ⋮ An efficient implementation of two-component relativistic density functional theory with torque-free auxiliary variables ⋮ High dimensional tori and chaotic and intermittent transients in magnetohydrodynamic Couette flows ⋮ Architecture-based and target-oriented algorithm optimization of high-order methods via complete-search tensor contraction ⋮ High-Performance Tensor Contraction without Transposition ⋮ A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs ⋮ Numerical stability of algorithms at extreme scale and low precisions ⋮ Upper and lower I/O bounds for pebbling \(r\)-pyramids ⋮ Implementing High-Performance Complex Matrix Multiplication via the 1M Method ⋮ Deriving dense linear algebra libraries ⋮ GMRES with embedded ensemble propagation for the efficient solution of parametric linear systems in uncertainty quantification of computational models ⋮ Automatic generation of fast algorithms for matrix–vector multiplication ⋮ Blocked algorithms for the reduction to Hessenberg-triangular form revisited ⋮ Householder QR Factorization With Randomization for Column Pivoting (HQRRP) ⋮ Parallel direct solver for solving systems of linear equations resulting from finite element method on multi-core desktops and workstations ⋮ Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance ⋮ Dominant speed factors of active set methods for fast MPC ⋮ Strassen's Algorithm for Tensor Contraction ⋮ A Componentwise Splitting Method for Pricing American Options Under the Bates Model ⋮ Direct reconstruction method for discontinuous Galerkin methods on higher-order mixed-curved meshes III. Code optimization via tensor contraction ⋮ Modulated rotating waves in the magnetised spherical Couette system ⋮ Safe feature elimination for non-negativity constrained convex optimization ⋮ BLIS: A Framework for Rapidly Instantiating BLAS Functionality ⋮ Continuation and stability of rotating waves in the magnetized spherical Couette system: secondary transitions and multistability ⋮ Computing the Gradient in Optimization Algorithms for the CP Decomposition in Constant Memory through Tensor Blocking ⋮ Unnamed Item ⋮ Analytical Modeling Is Enough for High-Performance BLIS
Uses Software