[Openmp-commits] [PATCH] D32321: [OpenMP] Optimized default kernel launch parameters in CUDA plugin

Thu Apr 20 23:15:54 PDT 2017

Hahnfeld added a comment.

Does this change result in a lower runtime? Last time I tested clang-ykt on Pascal GPUs, 1024 threads were really the best thing to do...

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:594-598
   // Add master warp if necessary
   if (KernelInfo->ExecutionMode == GENERIC) {
     cudaThreadsPerBlock += DeviceInfo.WarpSize[device_id];
     DP("Adding master warp: +%d threads\n", DeviceInfo.WarpSize[device_id]);
   }
----------------
Just move this code under `if (thread_limit > 0)`?

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:622-624
+      } else {
+        cudaBlocksPerGrid = loop_tripcount;
+      }
----------------
So each block executes one iteration? What is left for the threads in each block?

Repository:
  rL LLVM

https://reviews.llvm.org/D32321