[llvm-bugs] [Bug 52037] New: Performance regression in CUDA code compiled by Clang 13 versus Clang 12 (Clang 12 ~50% faster than Clang 13)

Sat Oct 2 01:46:26 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=52037

            Bug ID: 52037
           Summary: Performance regression in CUDA code compiled by Clang
                    13 versus Clang 12 (Clang 12 ~50% faster than Clang
                    13)
           Product: clang
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: CUDA
          Assignee: unassignedclangbugs at nondot.org
          Reporter: giuseppe.bilotta at gmail.com
                CC: llvm-bugs at lists.llvm.org

We have recently added support for building all of GPUSPH
(https://www.gpusph.org, code on https://github.com/GPUSPH/gpusph) using Clang
for both the host and device (CUDA) code, with excellent results when using
Clang 12 (performance even higher than nvcc). However, it seems that Clang 13
has introduced a regression, leading to worse performance: Clang 12 is around
50% faster than Clang 13.

Assuming Clang 12 and 13 are accessible in the path as clang++-12 and
clang++-13, this can be verified in the following way:

* clone GPUSPH (git clone git at github.com:GPUSPH/gpusph && cd gpusph)
* switch to the `next` banch (git checkout -b next origin/next)
* run: `make clean && make chrono=0 clang=12 ./DamBreak3D && ./DamBreak3D
--maxiter 1000 --nosave`
* run: `make clean && make chrono=0 clang=13 ./DamBreak3D && ./DamBreak3D
--maxiter 1000 --nosave`

The performance of the code can be compared looking at the MIPPS (millions of
iterations times particles per second) on the last 'Simulation time' line. On
my machine (GTX 1650, CC 7.5), building with Clang 12 gives me 3.1 MIPPS,
versus 2.1 MIPPS with Clang 13.

The main difference in performance between the two versions is in the
forcesDevice kernels, which are between 3x and 4x faster when compiled with
Clang 12 compared to Clang 13. (FWIW, nvcc achieves performances comparable to
Clang 13). From a quick analysis it seems that the Clang 12 version uses more
registers per thread, and the Clang 13 version has higher memory usage with
lower cache throughput (more spilling?)

Some of the kernels involved have a __launch_bounds__ specification, but
removing it does not affect the result.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20211002/0142b846/attachment-0001.html>