[Openmp-commits] [PATCH] D136641: [OPENMP] Fast cross-team reduction (xteamr) helper functions.
Greg Rodgers via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Mon Oct 24 14:59:13 PDT 2022
gregrodgers created this revision.
Herald added subscribers: guansong, yaxunl.
Herald added a project: All.
gregrodgers requested review of this revision.
Herald added a reviewer: jdoerfert.
Herald added subscribers: openmp-commits, sstefan1.
Herald added a project: OpenMP.
This review creates new DeviceRTL helper functions to support reductions in
OpenMP that are 50 to over 100 times faster than current openmp reductions.
The clang codegen to call these functions is not complete.
However, this review contains an extensive test for all the functions by
simulating the reduction with OpenMP without the reduction clause.
The test does the equivalent reduction with current OpenMP to show correctness
and performance.
EXAMPLE: Given this OpenMP reduction code, which is a classic dot product
with double precision vectors.
double sum = 0.0;
#pragma omp target teams distribute parallel for map(tofrom: sum) reduction(+:sum)
for (int64_t i = 0; i < array_size; i++)
sum += a[i] * b[i];
A reduction is defined by two pair reduction functions and the reduction null
value (rnv). In the above example, the reduction is defined by the pair
reduction function summing two double values, a pair reduction function
summing two LDS doubles, and rnv = (double) 0. The pair reduction functions
for built-in sum reduction are __kmpc_rfun_sum_d and __kmpc_rfun_sum_lds_d.
See Xteamr.cpp file for definitions of all xteamr functions.
Currently, the xteamr helper functions support 8 data types and 6 thread
configurations, 3 thread confis for warpsize 32 and 3 for warpsize 64.
Clang will generate code equivalent to the following simulation of
the reduction in OpenMP with a target offload pragma NOT containing a
reduction clause.
#define _NUM_THREADS 512 // or 1024 or 256, number of waves must be power of 2
#define _NUM_TEAMS 80 // or get this value from ompx_get_device_num_units(devid)
devid = 0 ; // default device or device from target construct
uint32_t zero = 0;
struct loop_ctl_t {
uint32_t *td_ptr; // Atomic counter accessed on device
uint32_t reserved; // reserved
const int64_t stride = 1; // stride to process input vectors
const int64_t offset = 0; // Offset to index of input vectors
const int64_t size = _ARRAY_SIZE; // Size of input vectors
const T rnv = T(0); // reduction null value
T *team_vals; // array of global team values
};
loop_ctl_t lc0; // Create and initialize a loop control structure.
lc0.team_vals = (T *)omp_target_alloc(sizeof(T) * _NUM_TEAMS, devid);
lc0.td_ptr = (uint32_t *)omp_target_alloc(sizeof(uint32_t), devid);
omp_target_memcpy(lc0.td_ptr, &zero, sizeof(uint32_t), 0, 0, devid,
omp_get_initial_device());
#pragma omp target teams distribute parallel for num_teams(_NUM_TEAMS) \
num_threads(_NUM_THREADS) map(tofrom:sum) map(to:lc0)
for (unsigned int k=0; k<(_NUM_TEAMS*_NUM_THREADS) ; k++) {
T val0 = lc0.rnv;
// This is the "BIGJUMP" loop code-gened by clang. A more complex form
// is actually generated using lc0 when non-zero offset and/or
// non-one stride loop is required.
for (int64_t i = 0; i<VSIZE; i += (TEAM_PROCS * _NUM_THREADS ))
val0 += a[i] * b[i]; // This is the outlined function.
// Each k iteration calls the helper function whose name is based on,
// data type, and wave config. Args include the function pointers
// for the pair reduction function.
__kmpc_xteamr_d_16x32(val0, &sum, lc0.team_vals, lc0.td_ptr,
__kmpc_rfun_sum_d, __kmpc_rfun_sum_lds_d, lc0.rnv);
}
In openmp/libomptarget/test/xteamr there is a comprehensive test of the
xteamr helper functions defined in this review.
The test_xteamr.cpp code in this review will generate the following
output on Nvidia volta with 80 teams and 512 threads. This shows
significant performance improvement over current OpenMP reductions.
For example, finding the minmum float value in a vector of floats was
over 140 times faster than the current openmp and summing doubles
was over 50 times faster.
TEST DOUBLE 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Precision: double
Warp size:32
Array elements: 41943040
Array size: 320 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 15517 0.043249 0.043364 0.043276 15507
simdot 847885 0.000791 0.000820 0.000803 835647
ompmax 7756 0.043260 0.043370 0.043289 7751
simmax 775913 0.000432 0.000451 0.000438 765812
ompmin 7755 0.043266 0.043312 0.043285 7752
simmin 773780 0.000434 0.000447 0.000439 765161
TEST FLOAT 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Precision: float
Warp size:32
Array elements: 41943040
Array size: 160 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 7816 0.042930 0.043042 0.042965 7810
simdot 744750 0.000451 0.000468 0.000459 730648
ompmax 3901 0.043010 0.043147 0.043042 3898
simmax 557542 0.000301 0.000309 0.000307 546905
ompmin 3898 0.043041 0.043245 0.043094 3893
simmin 562826 0.000298 0.000337 0.000307 546866
TEST INT 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Integer Size: 4
Warp size:32
Array elements: 41943040
Array size: 160 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 7811 0.042957 0.042976 0.042966 7810
simdot 755215 0.000444 0.000460 0.000451 743335
ompmax 3900 0.043020 0.043047 0.043032 3899
simmax 561618 0.000299 0.000309 0.000304 551575
ompmin 3897 0.043057 0.043101 0.043071 3895
simmin 552518 0.000304 0.000314 0.000309 543749
TEST UNSIGNED INT 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Integer Size: 4
Warp size:32
Array elements: 41943040
Array size: 160 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 7815 0.042936 0.042965 0.042949 7813
simdot 750056 0.000447 0.000460 0.000453 741305
ompmax 3900 0.043023 0.043048 0.043032 3899
simmax 559142 0.000300 0.000309 0.000305 550425
ompmin 3897 0.043049 0.043074 0.043064 3896
simmin 555710 0.000302 0.000311 0.000307 547178
TEST LONG 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Integer Size: 8
Warp size:32
Array elements: 41943040
Array size: 320 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 15501 0.043294 0.043331 0.043309 15495
simdot 811168 0.000827 0.000850 0.000838 800531
ompmax 7763 0.043222 0.043256 0.043241 7760
simmax 721225 0.000465 0.000482 0.000472 710672
ompmin 7764 0.043217 0.043251 0.043234 7761
simmin 728138 0.000461 0.000471 0.000467 718363
TEST UNSIGNED LONG 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Integer Size: 8
Warp size:32
Array elements: 41943040
Array size: 320 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 15485 0.043339 0.043378 0.043359 15478
simdot 817903 0.000820 0.000835 0.000829 809555
ompmax 8004 0.041921 0.042024 0.041938 8001
simmax 722079 0.000465 0.000479 0.000473 709582
ompmin 7754 0.043272 0.043298 0.043285 7752
simmin 721427 0.000465 0.000483 0.000474 707325
TEST DOUBLE COMPLEX 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Precision: double _Complex
Warp size:32
Array elements: 41943040
Array size: 640 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 30727 0.043680 0.043714 0.043700 30714
simdot 861368 0.001558 0.001579 0.001566 857175
TEST FLOAT COMPLEX 512 THREADS
Running kernels 12 times
Ignoring timing of first 2 runs
Precision: float _Complex
Warp size:32
Array elements: 41943040
Array size: 320 MB
Function Best-MB/sec Min (sec) Max Average Avg-MB/sec
ompdot 15453 0.043428 0.043452 0.043440 15449
simdot 797626 0.000841 0.000861 0.000849 790188
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D136641
Files:
openmp/libomptarget/DeviceRTL/CMakeLists.txt
openmp/libomptarget/DeviceRTL/include/Interface.h
openmp/libomptarget/DeviceRTL/include/Xteamr.h
openmp/libomptarget/DeviceRTL/src/Xteamr.cpp
openmp/libomptarget/test/xteamr/test_xteamr.cpp
openmp/libomptarget/test/xteamr/test_xteamr.h
openmp/libomptarget/test/xteamr/test_xteamr.sh
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D136641.470299.patch
Type: text/x-patch
Size: 135646 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/openmp-commits/attachments/20221024/59b900ec/attachment-0001.bin>
More information about the Openmp-commits
mailing list