[all-commits] [llvm/llvm-project] 644a96: [LV] Vectorize cases with larger number of RT chec...

Mon Jul 4 07:12:10 PDT 2022

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 644a965c1efef68f22d9495e4cefbb599c214788
      https://github.com/llvm/llvm-project/commit/644a965c1efef68f22d9495e4cefbb599c214788
  Author: Florian Hahn <flo at fhahn.com>
  Date:   2022-07-04 (Mon, 04 Jul 2022)

  Changed paths:
    M llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
    M llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
    M llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
    M llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
    M llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll
    M llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll
    M llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll
    M llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
    M llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
    M llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll
    M llvm/test/Transforms/LoopVectorize/X86/pr23997.ll
    M llvm/test/Transforms/LoopVectorize/X86/pr35432.ll
    M llvm/test/Transforms/LoopVectorize/X86/pr54634.ll
    M llvm/test/Transforms/LoopVectorize/X86/runtime-limit.ll

  Log Message:
  -----------
  [LV] Vectorize cases with larger number of RT checks, execute only if profitable.

This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both `IRSmk` and `srad` are relatively short running, but the changes are far above the noise level for them on my benchmark system.

```
CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%
```

`size` regressions on AArch64 with -O3 are

```
MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%
```

Reviewed By: lebedev.ri, ebrevnov, dmgreen

Differential Revision: https://reviews.llvm.org/D109368