[PATCH] D142232: Increase inline threshold multiplier to 11 in nvptx backend.

Wed Feb 1 10:48:28 PST 2023

tra accepted this revision.
tra added a comment.
This revision is now accepted and ready to land.

In D142232#4097030 <https://reviews.llvm.org/D142232#4097030>, @JackAKirk wrote:

> Do you know of any clear instructions to build any branch of cutlass, eigen, or thrust with clang cuda tip? Thrust is currently broken on clang cuda (https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1401952879) and I haven't had any more luck trying to build the others.
> Although I have apparently been able to build tensorflow with clang cuda.

Tensorflow is the only open-source build using clang to compile CUDA that I'm aware of. That build also includes NCCL, so you can sort of count that in, too.

> Summary: we plan to raise the inlining threshold multiplier in intel/llvm to 11, and you may wish to do the same upstream:

Are you saying that you do not plan to land this patch upstream?

> I looked into building cutlass etc with clang_cuda but currently their cmake doesn't easily support clang_cuda.

I'll try to get their CMake files improved to work with clang. The recent versions of cutlass *are* buildable with clang, with a few patches that should've been upstreamed by now.

> You can now build thrust using their cmake with clang_cuda using: https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1402287672

Good to know.

> When building thrust with clang_cuda, and in all other applications I have built with clang_cuda or dpcpp cuda backend, I find zero dependence of the threshold multiplier on compile time. I even tried building thrust with the threshold multiplier set to 1000 (instead of 5), and compile time of thrust didn't change.

For what it's worth, the issues related to inlining/unrolling thresholds I've ran into in the past fall into these broad categories:

- thresholds are too small. GPU code heavily relies on inlining and loop unrolling for performance. This is the most common scenario. Standard LLVM thresholds don't always work well on a GPU.
- thresholds are too high. The kernels become too large and start spilling registers into memory, which is very expensive. This mostly affects loop unrolling. I think I only had to add `noinline` once.
- Compilation slowdown due to IR size explosion. Needs a lot of inlining candidates just under the threshold. Tends to happen with template-heavy code. However, compiler also tends to spend a lot of compilation in the c++ front-end in such cases, so the optimization time increase is not always prominent.
- I think in all/most really problematic cases the root cause of the slowdown was some other optimization pass with superlinear complexity.

> I also cannot find any effect on the multiplier on execution time of thrust tests

I'd say it's unsurprising. As I mentioned before, there are relatively few cases where it's really needed in practice.

> However again dpcpp cuda was much more performant when the value was raised to 11 (https://gitlab.com/gromacs/gromacs/-/issues/3928#note_1259844220).

How does performance compare between clang and dpcpp and clang?

> Generally then the multiplier value of 5 appears to be usually good enough for clang_cuda, but from my testing a value of 11 is still better.

I think you did sufficient due diligence checks to show that `11` does not produce obvious negative side effects. Incremental IR growth and compilation time increase is a reasonable trade-off for improving a known weakness we have in NVPTX. I think it's a net gain.

I would suggest landing the patch in upstream LLVM.

That said, as with any heuristic change, there's a chance that it may have unforeseen consequences. If/when that happens, we can easily roll it back and users should be able to override the threshold manually.

I'll keep an eye on whether it causes any issues on our builds.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142232/new/

https://reviews.llvm.org/D142232