[Openmp-commits] [PATCH] D32321: [OpenMP] Optimized default kernel launch parameters in CUDA plugin

Fri Apr 21 17:12:05 PDT 2017

arpith-jacob added a comment.

Hi Jonas,

The numbers are based on my testing of the Rodinia benchmark on k40m.

We don't have a working compiler on Pascal as yet (many of the omptests fail on Pascal) so I have not benchmarked on that GPU.  Our compiler exposes a bug in the CUDA toolkit that is being fixed.  It is possible that 1024 threads perform better on Pascal (these are of course heuristics) so we should extend the functionality here once Pascal support is added to the compiler/runtime.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:622-624
+      } else {
+        cudaBlocksPerGrid = loop_tripcount;
+      }
----------------
Hahnfeld wrote:
> So each block executes one iteration? What is left for the threads in each block?
Correct.

This case is for the 'teams distribute' construct.  The assumption is that there is a nested parallel construct in which the threads within the block participate.  Example:

#pragma omp target teams distribute
for(...) {
  #pragma omp parallel for reduction(..)
  for(..) {}
}

Repository:
  rL LLVM

https://reviews.llvm.org/D32321