[PATCH] D142232: Increase inline threshold multiplier to 11 in nvptx backend.

Mon Jan 23 08:51:42 PST 2023

JackAKirk added a comment.

In D142232#4069780 <https://reviews.llvm.org/D142232#4069780>, @tra wrote:

> Bumping inlining threshold is generally beneficial for GPU performance. Due to a quirk in PTX spec, we often have to create local copies of byval function arguments and that is very expensive performance wise, with the only way to mitigate the issue to force-inline the functions. It becomes particularly painful in modern lambda-heavy C++ code.
>
> That said, in my experience, there are relatively few cases where such a bump is really needed and in a lot of them explicitly force-inlining the culprit function was sufficient. While I see how the change is beneficial for performance, I'm on the fence whether that benefit is worth everyone paying the incremental compile increase cost.
>
>> his value of 11 is optimal for clang++ cuda for all cases I've investigated.
>
> It's very likely that the subset is not representative. I'm sure it will help the benchmarks, but that's not what most of the people usually compile.
> Measuring impact on a somewhat realistic workload would be very helpful for establishing that the trade-off is worth it.
>
> I would suggest checking how the change affects compilation of these projects:
>
> - https://www.tensorflow.org/
> - https://github.com/NVIDIA/cutlass
> - https://github.com/NVIDIA/thrust
> - https://gitlab.com/libeigen/eigen

OK, it sounds like a good idea to investigate the compile time/execution time of samples for these projects. I just had a little look into building

In D142232#4069780 <https://reviews.llvm.org/D142232#4069780>, @tra wrote:

> Bumping inlining threshold is generally beneficial for GPU performance. Due to a quirk in PTX spec, we often have to create local copies of byval function arguments and that is very expensive performance wise, with the only way to mitigate the issue to force-inline the functions. It becomes particularly painful in modern lambda-heavy C++ code.
>
> That said, in my experience, there are relatively few cases where such a bump is really needed and in a lot of them explicitly force-inlining the culprit function was sufficient. While I see how the change is beneficial for performance, I'm on the fence whether that benefit is worth everyone paying the incremental compile increase cost.
>
>> his value of 11 is optimal for clang++ cuda for all cases I've investigated.
>
> It's very likely that the subset is not representative. I'm sure it will help the benchmarks, but that's not what most of the people usually compile.
> Measuring impact on a somewhat realistic workload would be very helpful for establishing that the trade-off is worth it.
>
> I would suggest checking how the change affects compilation of these projects:
>
> - https://www.tensorflow.org/
> - https://github.com/NVIDIA/cutlass
> - https://github.com/NVIDIA/thrust
> - https://gitlab.com/libeigen/eigen

Thanks for all the comments. We agree that we should do more testing of compile time/execution time for samples using some more larger projects. I am currently investigating tensorflow. Will post updates soon.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142232/new/

https://reviews.llvm.org/D142232