[all-commits] [llvm/llvm-project] a704e6: [flang] Added alternative inlining code for hlfir....

Mon Mar 3 09:58:42 PST 2025

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: a704e6587bfd974af053712c6da01fa04d74c31b
      https://github.com/llvm/llvm-project/commit/a704e6587bfd974af053712c6da01fa04d74c31b
  Author: Slava Zakharin <szakharin at nvidia.com>
  Date:   2025-03-03 (Mon, 03 Mar 2025)

  Changed paths:
    M flang/include/flang/Optimizer/Builder/HLFIRTools.h
    M flang/lib/Optimizer/Builder/HLFIRTools.cpp
    M flang/lib/Optimizer/HLFIR/Transforms/SimplifyHLFIRIntrinsics.cpp
    M flang/test/HLFIR/simplify-hlfir-intrinsics-cshift.fir

  Log Message:
  -----------
  [flang] Added alternative inlining code for hlfir.cshift. (#129176)

Flang generates slower code for `CSHIFT(CSHIFT(PTR(:,:,I),sh1,1),sh2,2)`
pattern in facerec than other compilers. The first CSHIFT can be done
as two memcpy's wrapped in a loop for the second dimension.
This does require creating a temporary array, but it seems to be faster,
than the current hlfir.elemental inlining.

I started with modifying the new index computation in
hlfir.elemental inlining: the new arith.select approach does enable
some vectorization in LLVM, but on x86 it is using gathers/scatters
and does not give much speed-up.

I also experimented with LoopBoundSplitPass
and InductiveRangeCheckElimination for a simple (not chained) CSHIFT
case, but I could not adjust them to split the loop with a condition
on the value of the IV into two loops with disjoint iteration spaces.
I thought if I could do it, I would be able to keep the hlfir.elemental
inlining mostly untouched, and then adjust the hlfir.elemental inlining
heuristics for the facerec case.

Since I was not able to make these pass work for me, I added a special
case inlining for CSHIFT(ARRAY,SH,DIM=1) via hlfir.eval_in_mem.
If ARRAY is not statically known to have the contiguous leading
dimension, there is a dynamic check for contiguity, which allows
exposing it to LLVM and enabling the rewrite of the copy loops
into memcpys. This approach is stepping on the toes of LoopVersioning,
but it is helpful in facerec case.

I measured ~6% speed-up on grace, and ~4% on zen4.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications