[Openmp-commits] [PATCH] D32321: [OpenMP] Optimized default kernel launch parameters in CUDA plugin
George Rokos via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Thu Apr 20 16:15:18 PDT 2017
grokos created this revision.
grokos added a project: OpenMP.
Herald added a subscriber: rengolin.
This patch modifies the default target kernel launch parameters (num_teams and thread_limit). The default thread_limit is set to 128 threads per team. In SPMD mode the kernel is launched with 128 threads, in non-SPMD mode we use 96 threads (+32 of the master warp).
The default number of teams has been optimized as follows. For the constructs below:
`#target teams distribute`
`#teams distribute`
`#target teams distribute simd`
`#teams distribute simd`
if the associated loop trip count is N, then the kernel is launched with N teams.
Repository:
rL LLVM
https://reviews.llvm.org/D32321
Files:
libomptarget/plugins/cuda/src/rtl.cpp
Index: libomptarget/plugins/cuda/src/rtl.cpp
===================================================================
--- libomptarget/plugins/cuda/src/rtl.cpp
+++ libomptarget/plugins/cuda/src/rtl.cpp
@@ -99,7 +99,7 @@
static const int HardTeamLimit = 1<<16; // 64k
static const int HardThreadLimit = 1024;
static const int DefaultNumTeams = 128;
- static const int DefaultNumThreads = 1024;
+ static const int DefaultNumThreads = 128;
// Record entry point associated with device
void addOffloadEntry(int32_t device_id, __tgt_offload_entry entry) {
@@ -583,6 +583,10 @@
DP("Setting CUDA threads per block to requested %d\n", thread_limit);
} else {
cudaThreadsPerBlock = DeviceInfo.NumThreads[device_id];
+ if (KernelInfo->ExecutionMode == GENERIC) {
+ // Leave room for the master warp which will be added below.
+ cudaThreadsPerBlock -= DeviceInfo.WarpSize[device_id];
+ }
DP("Setting CUDA threads per block to default %d\n",
DeviceInfo.NumThreads[device_id]);
}
@@ -612,8 +616,12 @@
int cudaBlocksPerGrid;
if (team_num <= 0) {
if (loop_tripcount > 0 && DeviceInfo.EnvNumTeams < 0) {
- // round up to the nearest integer
- cudaBlocksPerGrid = ((loop_tripcount - 1) / cudaThreadsPerBlock) + 1;
+ if (KernelInfo->ExecutionMode == SPMD) {
+ // round up to the nearest integer
+ cudaBlocksPerGrid = ((loop_tripcount - 1) / cudaThreadsPerBlock) + 1;
+ } else {
+ cudaBlocksPerGrid = loop_tripcount;
+ }
DP("Using %d teams due to loop trip count %" PRIu64 " and number of "
"threads per block %d\n", cudaBlocksPerGrid, loop_tripcount,
cudaThreadsPerBlock);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D32321.96044.patch
Type: text/x-patch
Size: 1701 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-commits/attachments/20170420/c0d700d2/attachment.bin>
More information about the Openmp-commits
mailing list