[all-commits] [llvm/llvm-project] 22d982: [NVPTX] Increase inline threshold multiplier to 11...

Wed Feb 8 04:43:27 PST 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 22d98280dd8ee70064899eefb973a1c020605874
      https://github.com/llvm/llvm-project/commit/22d98280dd8ee70064899eefb973a1c020605874
  Author: JackAKirk <jack.kirk at codeplay.com>
  Date:   2023-02-08 (Wed, 08 Feb 2023)

  Changed paths:
    M llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h

  Log Message:
  -----------
  [NVPTX] Increase inline threshold multiplier to 11 in nvptx backend.

I used https://github.com/zjin-lcf/HeCBench (with nvcc usage swapped to
clang++), which is an adaptation of the classic Rodinia benchmarks aimed
at CUDA and SYCL programming models, to compare different values of the
multiplier using both clang++ cuda and clang++ sycl nvptx backends. I
find that the value is currently too low for both cases. Qualitatively
(and in most cases there is very a close quantitative agreement across
both cases) the change in code execution time for a range of values from
5 to 1000 matches in both variations (CUDA clang++ vs SYCL (with cuda
backend) using the intel/llvm clang++ compiler) of the HeCbench samples.
This value of 11 is optimal for clang++ cuda for all cases I've
investigated. I have not found a single case where performance is
deprecated by this change of the value from 5 to 11. For one sample the
sycl cuda backend preferred a higher value. However we are happy to
prioritize clang++ cuda, and we find that this value is close to ideal
for both cases anyway. It would be good to do some further investigation
using clang++ openmp cuda offload. However since I do not know of an
appropriate set of benchmarks for this case, and the fact that we are
now getting complaints about register spills related to insufficient
inlining on a weekly basis, we have decided to propose this change and
potentially seek some more input from someone who may have more
expertise in the openmp case. Incidentally this value coincides with the
value used for the amd-gcn backend. We have also been able to use the
amd backend of the intel/llvm "dpc++" compiler to compare the inlining
behaviour of an identical code when targetting amd (compared to nvptx).
Unsurprisingly the amd backend with a multiplier value of 11 was
performing better (with regard to inlining) than the nvptx case when the
value of 5 was used. When the two backends use the same multiplier value
the inlining behaviors appear to align closely.

This also considerably improves the performance of at least one of the
most popular HPC applications: NWCHEMX.

Signed-off-by: JackAKirk <jack.kirk at codeplay.com>

Reviewed by: tra
Differential Revision: https://reviews.llvm.org/D142232