[llvm-bugs] [Bug 43998] New: Poor performance of OpenMP distribute construct
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Nov 13 17:31:14 PST 2019
https://bugs.llvm.org/show_bug.cgi?id=43998
Bug ID: 43998
Summary: Poor performance of OpenMP distribute construct
Product: OpenMP
Version: unspecified
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Clang Compiler Support
Assignee: unassignedclangbugs at nondot.org
Reporter: csdaley at lbl.gov
CC: llvm-bugs at lists.llvm.org
The OpenMP distribute construct performs significantly worse than manually
dividing loop iterations between thread teams. Please see the test program
below which shows the performance of both methods on a system with Intel
Skylake CPUs and NVIDIA V100 GPUs. The performance difference is ~700x. I am
using LLVM/Clang from Nov 11 2019, although there is the same issue when using
LLVM/Clang from Aug 28 2019.
$ make
clang++ -std=c++11 -Ofast -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o
test.exe test.cpp
$ srun -n 1 ./test.exe
Number of sites = 1048576
Executing 100 iterations
Time w/distribute = 2.087 seconds
Time workaround = 0.003 seconds
$ cat test.cpp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <chrono>
typedef std::chrono::system_clock Clock;
#define ITERATIONS 100
#define TOTAL_SITES 1048576
int main(int argc, char *argv[])
{
int total_sites = TOTAL_SITES;
printf("Number of sites = %d\n", total_sites);
printf("Executing %d iterations\n", ITERATIONS);
auto tstart = Clock::now();
for (int iters=0; iters<ITERATIONS; ++iters) {
#pragma omp target teams distribute
for(int i=0; i<total_sites; ++i) {
;
}
}
double sec =
std::chrono::duration_cast<std::chrono::microseconds>(Clock::now()-tstart).count()
/ 1.0E6;
printf("Time w/distribute = %.3f seconds\n", sec);
tstart = Clock::now();
for (int iters=0; iters<ITERATIONS; ++iters) {
#pragma omp target teams
{
int total_teams = omp_get_num_teams();
int team_id = omp_get_team_num();
int sites_per_team = (total_sites + total_teams - 1) / total_teams;
int istart = team_id * sites_per_team;
if (istart > total_sites) istart = total_sites;
int iend = istart + sites_per_team;
if (iend > total_sites) iend = total_sites;
/* This is the total_sites loop manually chopped up */
for (int i = istart; i < iend; ++i) {
;
}
}
}
sec =
std::chrono::duration_cast<std::chrono::microseconds>(Clock::now()-tstart).count()
/ 1.0E6;
printf("Time workaround = %.3f seconds\n", sec);
}
The performance of the distribute construct can be improved by reducing the
number of teams using the num_teams clause. However, the performance is never
competitive compared to manually dividing loop iterations between thread teams.
Thanks,
Chris
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20191114/a8713379/attachment.html>
More information about the llvm-bugs
mailing list