[PATCH] D142232: Increase inline threshold multiplier to 11 in nvptx backend.

Fri Jan 20 11:26:01 PST 2023

tra added a comment.

Bumping inlining threshold is generally beneficial for GPU performance. Due to a quirk in PTX spec, we often have to create local copies of byval function arguments and that is very expensive performance wise, with the only way to mitigate the issue to force-inline the functions. It becomes particularly painful in modern lambda-heavy C++ code.

That said, in my experience, there are relatively few cases where such a bump is really needed and in a lot of them explicitly force-inlining the culprit function was sufficient. While I see how the change is beneficial for performance, I'm on the fence whether that benefit is worth everyone paying the incremental compile increase cost.

> his value of 11 is optimal for clang++ cuda for all cases I've investigated.

It's very likely that the subset is not representative. I'm sure it will help the benchmarks, but that's not what most of the people usually compiler.
Measuring impact on a somewhat realistic workload would be very helpful for establishing that the trade-off is worth it.

I would suggest checking how the change affects compilation of these projects:

- https://www.tensorflow.org/
- https://github.com/NVIDIA/cutlass
- https://github.com/NVIDIA/thrust
- https://gitlab.com/libeigen/eigen

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142232/new/

https://reviews.llvm.org/D142232