[PATCH] D109368: [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost.

Tue Sep 28 13:21:44 PDT 2021

fhahn updated this revision to Diff 375681.
fhahn marked 17 inline comments as done.
fhahn added a comment.

Updated to address the inline comments. Thanks everyone! I hope I didn't miss any (apologies if I did).

In D109368#3022743 <https://reviews.llvm.org/D109368#3022743>, @lebedev.ri wrote:

> Also, this patch should probably steal it's subject from D109296 <https://reviews.llvm.org/D109296>, the current one does not do it's justice.

Yep I need to update the title & description! I'll do that once I collected a bit more supporting data.

In D109368#3023635 <https://reviews.llvm.org/D109368#3023635>, @dmgreen wrote:

> This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.
>
> But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
> The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
> It looks like there were other inefficiencies with the values created too, with `select false, x, y` or `or x, false` being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

Thanks for raising this point! As you mentioned, we should try to make sure the code we emit for the runtime checks is as close to the final version as possible. The `or x, false` case should be easy to fix. For the umul_with_overflow I'd need to take a closer look to see where the best place to improve that would be.

In D109368#3023702 <https://reviews.llvm.org/D109368#3023702>, @nikic wrote:

> Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

I am still collecting the relevant data and I'll share it here once I have more details. On a high level the impact on the number of vectorized loops is relatively small on a large set of benchmarks (SPEC2006/SPEC2017/MultiSource) with a ~1% increase in vectorized loops on ARM64 with -O3.

So far the top size increase seem to be in smaller benchmarks. I need to take a closer look at IRSmk and smg2000 in particular. First inspection of IRSmk shows that is mostly consists of 2-3 large loops for which we did not generate runtime checks for so far, but do with that patch. Still need to check why the increase is so big.

  test-suite...s/ASC_Sequoia/IRSmk/IRSmk.test   3240.00    26668.00   723.1%
  test-suite...CI_Purple/SMG2000/smg2000.test   154724.00  403628.00  160.9%
  test-suite...chmarks/Rodinia/srad/srad.test   4332.00    5736.00    32.4%
  test-suite...Source/Benchmarks/sim/sim.test   13628.00   16660.00   22.2%
  test-suite...pplications/oggenc/oggenc.test   202936.00  232608.00  14.6%
  test-suite...oxyApps-C/miniGMG/miniGMG.test   51588.00   57792.00   12.0%
  test-suite...arks/mafft/pairlocalalign.test   356532.00  397024.00  11.4%
  test-suite...pps-C/SimpleMOC/SimpleMOC.test   30488.00   33436.00    9.7%
  test-suite...oxyApps-C/miniAMR/miniAMR.test   51744.00   55580.00    7.4%
  test-suite...rks/FreeBench/pifft/pifft.test   57368.00   59032.00    2.9%
  test-suite...8.imagick_r/538.imagick_r.test   1484356.00 1518020.00  2.3%
  test-suite...yApps-C++/PENNANT/PENNANT.test   105064.00  107004.00   1.8%
  test-suite...rks/tramp3d-v4/tramp3d-v4.test   776340.00  788504.00   1.6%
  test-suite...6/464.h264ref/464.h264ref.test   614848.00  623200.00   1.4%

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109368/new/

https://reviews.llvm.org/D109368

Files:
  llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
  llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
  llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll
  llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll
  llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
  llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll
  llvm/test/Transforms/LoopVectorize/X86/pr23997.ll
  llvm/test/Transforms/LoopVectorize/X86/pr35432.ll
  llvm/test/Transforms/LoopVectorize/X86/runtime-limit.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D109368.375681.patch
Type: text/x-patch
Size: 21909 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210928/d1c51530/attachment.bin>