[Openmp-commits] [PATCH] D32321: [OpenMP] Optimized default kernel launch parameters in CUDA plugin

Jonas Hahnfeld via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Thu Apr 20 23:15:54 PDT 2017

Hahnfeld added a comment.

Does this change result in a lower runtime? Last time I tested clang-ykt on Pascal GPUs, 1024 threads were really the best thing to do...

Comment at: libomptarget/plugins/cuda/src/rtl.cpp:594-598
   // Add master warp if necessary
   if (KernelInfo->ExecutionMode == GENERIC) {
     cudaThreadsPerBlock += DeviceInfo.WarpSize[device_id];
     DP("Adding master warp: +%d threads\n", DeviceInfo.WarpSize[device_id]);
Just move this code under `if (thread_limit > 0)`?

Comment at: libomptarget/plugins/cuda/src/rtl.cpp:622-624
+      } else {
+        cudaBlocksPerGrid = loop_tripcount;
+      }
So each block executes one iteration? What is left for the threads in each block?



More information about the Openmp-commits mailing list