CUBLAS
From MaRDI portal
Software:18949
swMATH6880MaRDI QIDQ18949FDOQ18949
Author name not available (Why is that?)
Cited In (80)
- HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
- A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators
- Parallel reduction of four matrices to condensed form for a generalized matrix eigenvalue algorithm
- A GPU application for high-order compact finite difference scheme
- Evaluation of gas sales agreements with indexation using tree and least-squares Monte Carlo methods on graphics processing units
- A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures
- Exposing fine-grained parallelism in algebraic multigrid methods
- Higher order finite elements in space and time for anisotropic simulations with variational integrators. Application of an efficient GPU implementation
- Fast and robust flow simulations in discrete fracture networks with gpgpus
- Efficient determination of the Markovian time-evolution towards a steady-state of a complex open quantum system
- GPU-accelerated algorithms for many-particle continuous-time quantum walks
- Megapixel topology optimization on a graphics processing unit
- Auto-tuned Krylov methods on cluster of graphics processing unit
- Introduction to high performance scientific computing
- Solving time-fractional reaction-diffusion systems through a tensor-based parallel algorithm
- A new efficient and accurate spline algorithm for the matrix exponential computation
- GPU optimization of large-scale eigenvalue solver
- Efficient \(L_0\) resampling of point sets
- Discrete particle swarm optimization for constructing uniform design on irregular regions
- Fast Taylor polynomial evaluation for the computation of the matrix cosine
- Algorithms for efficient reproducible floating point summation
- GPU-accelerated preconditioned GMRES method for two-dimensional Maxwell's equations
- Compressed hierarchical Schur algorithm for frequency-domain analysis of photonic structures
- An efficient and accurate algorithm for computing the matrix cosine based on new Hermite approximations
- MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner
- A parallel computing method using blocked format with optimal partitioning for SpMV on GPU
- Low synchronization Gram–Schmidt and generalized minimal residual algorithms
- Efficient and accurate algorithms for computing matrix trigonometric functions
- Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver
- Accelerating GPU kernels for dense linear algebra
- GPU-based block-wise nonlocal means denoising for 3D ultrasound images
- 3D data denoising via nonlocal means filter by using parallel GPU strategies
- Accelerated dimension-independent adaptive metropolis
- GPU accelerated computational homogenization based on a variational approach in a reduced basis framework
- GPU-accelerated Bernstein-Bézier discontinuous Galerkin methods for wave problems
- A GPU-accelerated hybridizable discontinuous Galerkin method for linear elasticity
- GPGPU-based parallel computing applied in the FEM using the conjugate gradient algorithm: a review
- Title not available (Why is that?)
- FloatX
- New Hermite series expansion for computing the matrix hyperbolic cosine
- Development of a parallel CUDA algorithm for solving 3D guiding center problems
- Graphics processing units and high-dimensional optimization
- Updating incomplete factorization preconditioners for model order reduction
- Rapid re-meshing and re-solution of three-dimensional boundary element problems for interactive stress analysis
- Optimal size of the block in block GMRES on GPUs: computational model and experiments
- On the GPGPU parallelization issues of finite element approximate inverse preconditioning
- Parallel and Heterogeneous $m$--Hessenberg--Triangular--Triangular Reduction
- Finite element integration on GPGPUs
- A heterogeneous parallel LU factorization algorithm based on a basic column block uniform allocation strategy
- Redesigning triangular dense matrix computations on GPUs
- Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers
- Efficient GPU-based implementations of simplex type algorithms
- Acceleration of early-photon fluorescence molecular tomography with graphics processing units
- Towards a parallel component in a GPU-CUDA environment: a case study with the L-BFGS Harwell routine
- GPU accelerated intensities MPI (GAIN-MPI): a new method of computing Einstein-\(A\) coefficients
- AmgX: a library for GPU accelerated algebraic multigrid and preconditioned iterative methods
- Solving a large-scale thermal radiation problem using an interoperable executive library framework on petascale supercomputers
- Parallel solver for shifted systems in a hybrid CPU-GPU framework
- Cucheb: a GPU implementation of the filtered Lanczos procedure
- The eigenvalues slicing library (EVSL): algorithms, implementation, and software
- High-performance statistical computing in the computing environments of the 2020s
- A \(\mu\)-mode BLAS approach for multidimensional tensor-structured problems
- Highly efficient GPU eigensolver for three-dimensional photonic crystal band structures with any Bravais lattice
- Title not available (Why is that?)
- A Dynamic Pattern Factored Sparse Approximate Inverse Preconditioner on Graphics Processing Units
- Accelerating the explicitly restarted Arnoldi method with GPUs using an autotuned matrix vector product
- Parallel Prony's Method with Multivariate Matrix Pencil Approach and Its Numerical Aspects
- Strassen's algorithm reloaded on GPUs
- A Framework for Error-Bounded Approximate Computing, with an Application to Dot Products
- Batch Matrix Exponentiation
- CUDA-based scientific computing. Tools and selected applications
- Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores
- KBLAS: an optimized library for dense matrix-vector multiplication on GPU accelerators
- A Fast Parallel SVM Algorithm for Massive Classification Tasks
- A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines
- A flexible CUDA LU-based solver for small, batched linear systems
- A fast dense triangular solve in CUDA
- Performance and numerical accuracy evaluation of heterogeneous multicore systems for Krylov orthogonal basis computation
- An error correction solver for linear systems: evaluation of mixed precision implementations
- SCELib4.0: the new program version for computing molecular properties in the single center approach
This page was built for software: CUBLAS