[PATCH] D142232: Increase inline threshold multiplier to 11 in nvptx backend.

Wed Feb 1 09:04:00 PST 2023

JackAKirk added a comment.

In D142232#4074418 <https://reviews.llvm.org/D142232#4074418>, @tra wrote:

> I would expect thrust to be close to the worst-case scenario, and it also has a pretty extensive set of tests to compile. If there's no major compile-time regression on thrust tests, I'll be fine with the patch.
>
>> I am currently investigating tensorflow. Will post updates soon.
>
> Keep in mind that GPU-related part of TF compilation is relatively small, compared to the rest of the build, so results there may be noisy. 
> The other projects on the list have more GPU code to compile. They may also be much easier to compile, compared to tensorflow.
>
> You may also add https://github.com/NVIDIA/nccl to the list (It's also part of TF build).

Do you know of any clear instructions to build any branch of cutlass, eigen, or thrust with clang cuda tip? Thrust is currently broken on clang cuda (https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1401952879) and I haven't had any more luck trying to build the others.
Although I have apparently been able to build tensorflow with clang cuda.

In D142232#4077906 <https://reviews.llvm.org/D142232#4077906>, @tra wrote:

> In D142232#4076580 <https://reviews.llvm.org/D142232#4076580>, @JackAKirk wrote:
>
>> OK, I've looked into building thrust with clang cuda. Ran into some issues: https://github.com/NVIDIA/thrust/issues/1853
>
> Newer versions of thrust may have issues w/ clang. In the past it regularly needed portability fixes. An old thrust revision f5ea60fd3aa3828c0eb8991a54acdfbed6707bd7 should be buildable w/ clang, though the CMakeFiles there may be too old to support clang as the cuda compiler. If you run into too much trouble, just skip it.
>
> cutlass and nccl may be in better shape. Sorry about being vague -- we do compile cutlass/nccl/thrust with clang, but not always a recent version and we're not relying on cmake to do it, so I can't say what's the state of the official build of those projects when it comes to using clang as a CUDA compiler.
> For the purposes of this experiment a quick-and-dirty solution of configuring the build to use nvcc, capturing the commands run by the build, editing them to replace NVCC and NVCC-specific options with clang equivalents, and running those commands as a script may do the trick.

In D142232#4077906 <https://reviews.llvm.org/D142232#4077906>, @tra wrote:

> In D142232#4076580 <https://reviews.llvm.org/D142232#4076580>, @JackAKirk wrote:
>
>> OK, I've looked into building thrust with clang cuda. Ran into some issues: https://github.com/NVIDIA/thrust/issues/1853
>
> Newer versions of thrust may have issues w/ clang. In the past it regularly needed portability fixes. An old thrust revision f5ea60fd3aa3828c0eb8991a54acdfbed6707bd7 should be buildable w/ clang, though the CMakeFiles there may be too old to support clang as the cuda compiler. If you run into too much trouble, just skip it.
>
> cutlass and nccl may be in better shape. Sorry about being vague -- we do compile cutlass/nccl/thrust with clang, but not always a recent version and we're not relying on cmake to do it, so I can't say what's the state of the official build of those projects when it comes to using clang as a CUDA compiler.
> For the purposes of this experiment a quick-and-dirty solution of configuring the build to use nvcc, capturing the commands run by the build, editing them to replace NVCC and NVCC-specific options with clang equivalents, and running those commands as a script may do the trick.

Cheers.

So here is the situation:

Summary: we plan to raise the inlining threshold multiplier in intel/llvm to 11, and you may wish to do the same upstream:

I looked into building cutlass etc with clang_cuda but currently their cmake doesn't easily support clang_cuda.
You can now build thrust using their cmake with clang_cuda using: https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1402287672
When building thrust with clang_cuda, and in all other applications I have built with clang_cuda or dpcpp cuda backend, I find zero dependence of the threshold multiplier on compile time. I even tried building thrust with the threshold multiplier set to 1000 (instead of 5), and compile time of thrust didn't change.
I also cannot find any effect on the multiplier on execution time of thrust tests (note that I did not examine every test well enough to be sure that there is zero effect with statistical significance, but if there was an effect either way it must have been very small). I don't think this is very surprising considering that thrust marks things inline very frequently, and that clang_cuda appears to be generally less sensitive to the inlining threshold than dpc++ cuda. The one sample I have where clang_cuda had a very large performance improvement (there were some others in HeCbench with smaller clang_cuda execution time improvements when raising to 11) in raising the multiplier to 11 was this sample (https://github.com/zjin-lcf/HeCBench/tree/master/bonds-cuda). You can see that there are lots of levels of function calls that are not marked inline in this code.
Someone else on my team investigated NWCHEMX using dpcpp cuda and found a big improvement with raising the multiplier to 11. I've also investigated GROMACS compiled with clang_cuda and dpcpp cuda. I find that again clang_cuda has no change in performance depending on the multiplier value (5 vs 11), (One kernel looked like it could be a couple of percent faster with a multiplier value of 11 but I didn't run it enough times for this to be statistically significant to many sigma). However again dpcpp cuda was much more performant when the value was raised to 11 (https://gitlab.com/gromacs/gromacs/-/issues/3928#note_1259844220).
Generally then the multiplier value of 5 appears to be usually good enough for clang_cuda, but from my testing a value of 11 is still better.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142232/new/

https://reviews.llvm.org/D142232