Multicore-optimized wavefront diamond blocking for optimizing stencil updates
From MaRDI portal
Publication:5264147
Abstract: The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor.
Recommendations
- Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
- Algorithm 942
- Modeling the performance of geometric multigrid stencils on multicore computer architectures
- Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method
- Locally recursive non-locally asynchronous algorithms for stencil computation
Cites work
Cited in
(6)- Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
- Locally recursive non-locally asynchronous algorithms for stencil computation
- A new memory mapping mechanism for GPGPUs' stencil computation
- Designing a 3D parallel memory-aware lattice Boltzmann algorithm on manycore systems
- Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition
- Accelerating stencil computation on GPGPU by novel mapping method between the global memory and the shared memory
This page was built for publication: Multicore-optimized wavefront diamond blocking for optimizing stencil updates
Report a bug (only for logged in users!)Click here to report a bug for this page (MaRDI item Q5264147)