Reliable generation of high-performance matrix algebra
From MaRDI portal
Publication:2828141
Abstract: Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a state-of-the-art optimizing compiler. However, our experiments show that optimizing compilers often attain only one-quarter the performance of hand-optimized code. In this paper we present a domain-specific compiler for matrix algebra, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16% slower and 39% faster than hand-optimized code.
Recommendations
Cites work
- scientific article; zbMATH DE number 1728268 (Why is no real title available?)
- scientific article; zbMATH DE number 1131224 (Why is no real title available?)
- scientific article; zbMATH DE number 2086384 (Why is no real title available?)
- scientific article; zbMATH DE number 1424342 (Why is no real title available?)
- A set of level 3 basic linear algebra subprograms
- An extended set of FORTRAN basic linear algebra subprograms
- An updated set of basic linear algebra subprograms (BLAS)
- Basic Linear Algebra Subprograms for Fortran Usage
- Cache efficient bidiagonalization using BLAS 2.5 operators
- FLAME
- Families of Algorithms for Reducing a Matrix to Condensed Form
- Languages and Compilers for Parallel Computing
Cited in
(6)- scientific article; zbMATH DE number 1424342 (Why is no real title available?)
- Automatic generation of fast algorithms for matrix–vector multiplication
- BLASFEO: Basic linear algebra subroutines for embedded optimization
- The BLAS API of BLASFEO: optimizing performance for small matrices
- Analytical modeling is enough for high-performance BLIS
- TTC: a high-performance compiler for tensor transpositions
This page was built for publication: Reliable generation of high-performance matrix algebra
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q2828141)