KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators

From MaRDI portal
Publication:5270751

DOI10.1145/2818311zbMATH Open1369.65042arXiv1410.1726OpenAlexW1839773802WikidataQ113310224 ScholiaQ113310224MaRDI QIDQ5270751FDOQ5270751


Authors: Ahmad Abdelfattah, Hatem Ltaief, D. E. Keyes Edit this on Wikidata


Publication date: 30 June 2017

Published in: ACM Transactions on Mathematical Software (Search for Journal in Brave)

Abstract: KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS is able to efficiently run on various GPU architectures across different generations, avoiding the time-consuming step of code rewriting, while still being compliant with the standard BLAS API. Another advanced optimization technique allows to ensure coalesced memory access when dealing with submatrices, especially in the context of high level dense linear algebra algorithms. All four precisions KBLAS kernels have been leveraged to multi-GPUs environment, which requires the introduction of new APIs to ease users' experiences on these challenging systems. The KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes, achieves asymptotically up to 50% and 60% speedup on single GPU and multi-GPUs systems, respectively, and validates our performance model. A subset of KBLAS high performance kernels has been integrated into NVIDIA's standard BLAS implementation (cuBLAS) for larger dissemination, starting version 6.0.


Full work available at URL: https://arxiv.org/abs/1410.1726




Recommendations




Cites Work


Cited In (4)

Uses Software





This page was built for publication: KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators

Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5270751)