<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Performance regression in CUDA code compiled by Clang 13 versus Clang 12 (Clang 12 ~50% faster than Clang 13)"

   href="https://bugs.llvm.org/show_bug.cgi?id=52037">52037</a>

          </td>

        </tr>


        <tr>

          <th>Summary</th>

          <td>Performance regression in CUDA code compiled by Clang 13 versus Clang 12 (Clang 12 ~50% faster than Clang 13)

          </td>

        </tr>


        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>


        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>


        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>


        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>


        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>


        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>


        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>


        <tr>

          <th>Component</th>

          <td>CUDA

          </td>

        </tr>


        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>


        <tr>

          <th>Reporter</th>

          <td>giuseppe.bilotta@gmail.com

          </td>

        </tr>


        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>We have recently added support for building all of GPUSPH

(<a href="https://www.gpusph.org">https://www.gpusph.org</a>, code on <a href="https://github.com/GPUSPH/gpusph">https://github.com/GPUSPH/gpusph</a>) using Clang

for both the host and device (CUDA) code, with excellent results when using

Clang 12 (performance even higher than nvcc). However, it seems that Clang 13

has introduced a regression, leading to worse performance: Clang 12 is around

50% faster than Clang 13.


Assuming Clang 12 and 13 are accessible in the path as clang++-12 and

clang++-13, this can be verified in the following way:


* clone GPUSPH (git clone <a href="mailto:git@github.com">git@github.com</a>:GPUSPH/gpusph && cd gpusph)

* switch to the `next` banch (git checkout -b next origin/next)

* run: `make clean && make chrono=0 clang=12 ./DamBreak3D && ./DamBreak3D

--maxiter 1000 --nosave`

* run: `make clean && make chrono=0 clang=13 ./DamBreak3D && ./DamBreak3D

--maxiter 1000 --nosave`


The performance of the code can be compared looking at the MIPPS (millions of

iterations times particles per second) on the last 'Simulation time' line. On

my machine (GTX 1650, CC 7.5), building with Clang 12 gives me 3.1 MIPPS,

versus 2.1 MIPPS with Clang 13.


The main difference in performance between the two versions is in the

forcesDevice kernels, which are between 3x and 4x faster when compiled with

Clang 12 compared to Clang 13. (FWIW, nvcc achieves performances comparable to

Clang 13). From a quick analysis it seems that the Clang 12 version uses more

registers per thread, and the Clang 13 version has higher memory usage with

lower cache throughput (more spilling?)


Some of the kernels involved have a __launch_bounds__ specification, but

removing it does not affect the result.</pre>

        </div>

      </p>


      <hr>

      <span>You are receiving this mail because:</span>


      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>