[PATCH] D142232: Increase inline threshold multiplier to 11 in nvptx backend.I used https://github.com/zjin-lcf/HeCBench (with nvcc usage swapped to clang++), which is an adaptation of the classic Rodinia benchmarks aimed at CUDA and SYCL programming models, to...

Fri Jan 20 09:14:58 PST 2023

JackAKirk created this revision.
JackAKirk added a reviewer: jlebar.
Herald added subscribers: mattd, gchakrabarti, asavonic, Naghasan, Anastasia, ebevhan, hiraditya, yaxunl.
Herald added a project: All.
JackAKirk requested review of this revision.
Herald added subscribers: llvm-commits, sstefan1, jholewinski.
Herald added a reviewer: jdoerfert.
Herald added a project: LLVM.

...compare different values of the multiplier using both clang++ cuda and clang++ sycl nvptx backends.
I find that the value is currently too low for both cases. Qualitatively (and in most cases there is very a close quantitative agreement across both cases) the change in code execution time for a range of values from 5 to 1000 matches in both variations (CUDA clang++ vs SYCL (with cuda backend) using the intel/llvm clang++ compiler) of the HeCbench samples.
This value of 11 is optimal for clang++ cuda for all cases I've investigated. I have not found a single case where performance is deprecated by this change of the value from 5 to 11. For one sample the sycl cuda backend preferred a higher value. However we are happy to prioritize clang++ cuda, and we find that this value is close to ideal for both cases anyway.
It would be good to do some further investigation using clang++ openmp cuda offload. However since I do not know of an appropriate set of benchmarks for this case, and the fact that we are now getting complaints about register spills related to insufficient inlining on a weekly basis, we have decided to propose this change and potentially seek some more input from someone who may have more expertise in the openmp case. 
Incidentally this value coincides with the value used for the amd-gcn backend. We have also been able to use the amd backend of the intel/llvm "dpc++" compiler to compare the inlining behaviour of an identical code when targetting amd (compared to nvptx). Unsurprisingly the amd backend with a multiplier value of 11 was performing better (with regard to inlining) than the nvptx case when the value of 5 was used. When the two backends use the same multiplier value the inlining behaviors appear to align closely.

This also considerably improves the performance of at least one of the most popular HPC applications: NWCHEMX.

Signed-off-by: JackAKirk <jack.kirk at codeplay.com>


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D142232

Files:
  llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h


Index: llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
===================================================================

--- llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -92,7 +92,7 @@
 
   // Increase the inlining cost threshold by a factor of 5, reflecting that
   // calls are particularly expensive in NVPTX.
-  unsigned getInliningThresholdMultiplier() { return 5; }
+  unsigned getInliningThresholdMultiplier() { return 11; }
 
   InstructionCost getArithmeticInstrCost(
       unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D142232.490890.patch
Type: text/x-patch
Size: 604 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230120/9e96c483/attachment.bin>