[llvm-bugs] [Bug 49588] New: Clang/OpenMP schedule(dynamic): severe scaling issue, up to 2.5x slower than GCC/OpenMP

Sun Mar 14 10:34:14 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=49588

            Bug ID: 49588
           Summary: Clang/OpenMP schedule(dynamic): severe scaling issue,
                    up to 2.5x slower than GCC/OpenMP
           Product: OpenMP
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Runtime Library
          Assignee: unassignedbugs at nondot.org
          Reporter: kim.walisch at gmail.com
                CC: llvm-bugs at lists.llvm.org

Created attachment 24647
  --> https://bugs.llvm.org/attachment.cgi?id=24647&action=edit
Benchmark script Clang/OpenMP vs GCC/OpenMP

Hi,

Recently while running benchmarks on servers with a large number of CPU cores
(e.g. 48 cores/96 threads or more) I found a severe Clang/OpenMP scaling issue
(tested with Clang 11) when using a parallel for loop with schedule(dynamic).
My primecount program (https://github.com/kimwalisch/primecount) uses dynamic
thread scheduling for many of its algorithms. My code usually looks like this:

#pragma omp parallel for schedule(dynamic) num_threads(threads)
reduction(+:sum)
for (int64_t i = start; i < iters; i++)
{
   sum += get_sum(i);
}

My primecount program has an algorithm named S2_easy
(https://github.com/kimwalisch/primecount/blob/v6.3/src/deleglise-rivat/S2_easy_libdivide.cpp#L173)
which uses the OpenMP parallel for loop above. Below are benchmark timings of
my primecount program compiled with Clang/OpenMP (1st timing) and GCC/OpenMP
(2nd timing) run on my dual-socket AMD EPYC Rome server with 96 cores/192
threads (Ubuntu 20.04 x64).

=== S2_easy(1e17) ===
605912179291437
Seconds: 0.272
605912179291437
Seconds: 0.300

=== S2_easy(1e18) ===
5901781179977516
Seconds: 1.011
5901781179977516
Seconds: 1.137

=== S2_easy(1e19) ===
57056063072961387
Seconds: 5.380
57056063072961387
Seconds: 6.821

=== S2_easy(1e20) ===
549803428290505054
Seconds: 26.916
549803428290505054
Seconds: 33.519

=== S2_easy(1e21) ===
5278045123014553812
Seconds: 162.901
5278045123014553812
Seconds: 154.708

=== S2_easy(1e22) ===
50561116428768745222
Seconds: 1198.296
50561116428768745222
Seconds: 701.485

=== S2_easy(1e23) ===
483646507922918296127
Seconds: 8129.958
483646507922918296127
Seconds: 3181.663

Up to about 1e20 Clang is measurably faster but then the scaling issue kicks in
and Clang's performance deteriorates more and more (at 1e23 Clang is 2.5x
slower than GCC). I have attached a bash script to this bug report that clone's
my primecount program, builds it with Clang and GCC and then runs the same
benchmark as above (using all CPU cores). Note that in order to reproduce this
scaling issue it is best to run the benchmark on a server with a large number
of CPU cores otherwise it might run for days...

I know the scaling issue is caused by schedule(dynamic) because using a
workaround instead of schedule(dynamic) fixes my scaling issue. E.g. the code
below which uses a mutex instead of schedule(dynamic) fixes my scaling issue,
now Clang runs as fast or faster than GCC.

int64_t i = 0;

#pragma omp parallel for num_threads(threads) reduction(+:sum)
for (int64_t t = 0; t < threads; t++)
{
   while (true)
   {
      #pragma omp critical
      int64_t j = i++;

      if (j < iters)
          sum += get_sum(j);
      else
          break;
   }
}

I have monitored my server while running the primecount binary built with
Clang/OpenMP and I have found that all CPU cores were busy till the very end.
So this issue is not a load balancing issue. There was also no memory leak. I
don't know exactly what causes the scaling issue, my best bet is that the
scaling issue is caused by your thread pool and how it schedules threads.
(Ideally your thread pool should simply assign a thread that finished an
iteration a new iteration on the same core it was running before, I could image
your code tries to be smart and something goes horribly wrong). My assumption
is also based on the fact that I can significantly improve Clang's performance
by changing the OPENMP environment variables that affect thread affinity. E.g.
both "export KMP_AFFINITY=granularity=fine,compact" and "export
OMP_WAIT_POLICY=PASSIVE" measurably improve Clang's performance but don't fix
the scaling issue.

This scaling issue is very annoying for me, much more annoying than the OpenMP
bug I submitted earlier this morning. In the meantime (until this scaling issue
is fixed) I have replaced schedule(dynamic) in important code sections of my
primecount program by a workaround. As I don't intimately know Clang I cannot
fix this bug in Clang's code base. But if you need any other help related to
this bug, please let me know.

Regards,
Kim Walisch

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210314/eec91602/attachment.html>