[Openmp-commits] [PATCH] D32321: [OpenMP] Optimized default kernel launch parameters in CUDA plugin
Arpith Jacob via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Fri Apr 21 17:12:05 PDT 2017
arpith-jacob added a comment.
Hi Jonas,
The numbers are based on my testing of the Rodinia benchmark on k40m.
We don't have a working compiler on Pascal as yet (many of the omptests fail on Pascal) so I have not benchmarked on that GPU. Our compiler exposes a bug in the CUDA toolkit that is being fixed. It is possible that 1024 threads perform better on Pascal (these are of course heuristics) so we should extend the functionality here once Pascal support is added to the compiler/runtime.
================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:622-624
+ } else {
+ cudaBlocksPerGrid = loop_tripcount;
+ }
----------------
Hahnfeld wrote:
> So each block executes one iteration? What is left for the threads in each block?
Correct.
This case is for the 'teams distribute' construct. The assumption is that there is a nested parallel construct in which the threads within the block participate. Example:
#pragma omp target teams distribute
for(...) {
#pragma omp parallel for reduction(..)
for(..) {}
}
Repository:
rL LLVM
https://reviews.llvm.org/D32321
More information about the Openmp-commits
mailing list