[PATCH] D142232: Increase inline threshold multiplier to 11 in nvptx backend.

Fri Feb 3 03:53:02 PST 2023

JackAKirk added a comment.

In D142232#4097346 <https://reviews.llvm.org/D142232#4097346>, @tra wrote:

> In D142232#4097030 <https://reviews.llvm.org/D142232#4097030>, @JackAKirk wrote:
>
>> Summary: we plan to raise the inlining threshold multiplier in intel/llvm to 11, and you may wish to do the same upstream:
>
> Are you saying that you do not plan to land this patch upstream?

No I think it would be ideal to land it upstream. It is just that we will make the change in intel/llvm regardless, because it is especially important for dpc++.

>> However again dpcpp cuda was much more performant when the value was raised to 11 (https://gitlab.com/gromacs/gromacs/-/issues/3928#note_1259844220).
>
> How does performance compare between clang and dpcpp and clang?

It seems to depend, but usually dpcpp cuda is slower than clang cuda for a few different reasons that I don't understand extremely well. One straightforward reason is that dpcpp always compiles for at least two targets (host and device). There is also the fact that by design dpc++ is using C++ on purpose and quite a lot of templating which can slow down compilation wrt CUDA. There are quite a few potential improvements to dpcpp compilation time that I think people are actively working on.
For the HeCbench sample most affected by the inline threshold bump (for both clang_cuda and dpcpp):
https://github.com/zjin-lcf/HeCBench/tree/master/bonds-cuda (adjusted to be compiled with clang_cuda) is almost exactly four times faster than the corresponding dpcpp code : https://github.com/zjin-lcf/HeCBench/tree/master/bonds-sycl (for both 5 and 11 values) (the execution times is almost exactly the same for dpcpp cuda and clang_cuda however).
For clang_cuda the total build time (either real or (usr + sys)) of bonds-cuda was 35% slower when using 11 compared to 5, but the execution time was more than twice as fast at the value of 11. I found these values using four different compilations/executions (for both 5 and 11) and the standard_error of the mean of these runs was tiny compared to the differences in times such that I am confident of the 35% slowdown. I also ran the benchmarks on a completely separate occasion to have a quick check for any long-time correlations (note that I had unique access to the cluster so did not expect any effects from other users activity so I don't mean this by long-time correlations) and I observed the same times. For the corresponding dpcpp cuda code there was a corresponding observed slowdown in compilation time from 5 to 11, but it was smaller at around 10%, which I think probably makes sense given the larger overhead when compiling with dpcpp.
For gromacs, from what I understand, a direct comparison in compile time between dpcpp cuda and clang_cuda is not exactly fair, since gromacs supports e.g. gpu offload use is used in at least slightly different ways for each. Using `make -j 128` I did two compilation runs using a multiplier value of 5 with clang_cuda and found that real time was 3m20.335s and 3m31.028s respectively for these two runs. I imagine that the real time for many threads could be quite noisy and the mean may be quite different from these values. The (usr + sys) timings for these runs was, to the nearest minute, 116minutes and 114minutes respectively. The corresponding times for two compilation runs of dpcpp cuda and a value of 5 was 2m1.756s and 1m58.822s (real), and (sys + usr) 55 and 54 minutes respectively.
Then, for dpcpp cuda the corresponding compilations times using the 11 value were basically the same. However for clang_cuda I appeared to observe a slow-down in real compilation time. The real time for was 4m31.509s but (usr + sys) was unchanged at 114 minutes. Note that using `make -j 2` showed no change in compilation time for clang_cuda from 11 wrt 5 (actually a value of 11 was slightly faster but I only made one measurement and this change is probably attributed to noise.). I don't know but I imagine that the distribution of compilation time when using 128 threads could be very complicated, so I'm not sure if any real conclusions can be made from these gromacs measurements. I've just included them for clarity.

> I would suggest landing the patch in upstream LLVM.
>
> That said, as with any heuristic change, there's a chance that it may have unforeseen consequences. If/when that happens, we can easily roll it back and users should be able to override the threshold manually.
>
> I'll keep an eye on whether it causes any issues on our builds.

I'd agree that from the available information, a multiplier value of 11 appears a bit better for clang_cuda too.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142232/new/

https://reviews.llvm.org/D142232