[llvm-bugs] [Bug 43998] New: Poor performance of OpenMP distribute construct

Wed Nov 13 17:31:14 PST 2019

https://bugs.llvm.org/show_bug.cgi?id=43998

            Bug ID: 43998
           Summary: Poor performance of OpenMP distribute construct
           Product: OpenMP
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Clang Compiler Support
          Assignee: unassignedclangbugs at nondot.org
          Reporter: csdaley at lbl.gov
                CC: llvm-bugs at lists.llvm.org

The OpenMP distribute construct performs significantly worse than manually
dividing loop iterations between thread teams. Please see the test program
below which shows the performance of both methods on a system with Intel
Skylake CPUs and NVIDIA V100 GPUs. The performance difference is ~700x. I am
using LLVM/Clang  from Nov 11 2019, although there is the same issue when using
LLVM/Clang from Aug 28 2019.

$ make
clang++ -std=c++11 -Ofast -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o
test.exe test.cpp

$ srun -n 1 ./test.exe 
Number of sites = 1048576
Executing 100 iterations
Time w/distribute = 2.087 seconds
Time workaround   = 0.003 seconds

$ cat test.cpp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#include <chrono>
typedef std::chrono::system_clock Clock;

#define ITERATIONS 100
#define TOTAL_SITES 1048576

int main(int argc, char *argv[])
{
  int total_sites = TOTAL_SITES;
  printf("Number of sites = %d\n", total_sites);
  printf("Executing %d iterations\n", ITERATIONS);

  auto tstart = Clock::now();
  for (int iters=0; iters<ITERATIONS; ++iters) {
    #pragma omp target teams distribute
    for(int i=0; i<total_sites; ++i) {
      ;
    }
  }
  double sec =
std::chrono::duration_cast<std::chrono::microseconds>(Clock::now()-tstart).count()
/ 1.0E6;
  printf("Time w/distribute = %.3f seconds\n", sec);

  tstart = Clock::now();
  for (int iters=0; iters<ITERATIONS; ++iters) {
    #pragma omp target teams
    {
      int total_teams = omp_get_num_teams();
      int team_id = omp_get_team_num();
      int sites_per_team = (total_sites + total_teams - 1) / total_teams;
      int istart = team_id * sites_per_team;
      if (istart > total_sites) istart = total_sites;
      int iend = istart + sites_per_team;
      if (iend > total_sites) iend = total_sites;

      /* This is the total_sites loop manually chopped up */
      for (int i = istart; i < iend; ++i) {
        ;
      }
    }
  }
  sec =
std::chrono::duration_cast<std::chrono::microseconds>(Clock::now()-tstart).count()
/ 1.0E6;
  printf("Time workaround   = %.3f seconds\n", sec);
}

The performance of the distribute construct can be improved by reducing the
number of teams using the num_teams clause. However, the performance is never
competitive compared to manually dividing loop iterations between thread teams.

Thanks,
Chris

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20191114/a8713379/attachment.html>