[PATCH] D109296: [LV] Improve inclusivity of vectorization

Sun Sep 19 12:11:12 PDT 2021

lebedev.ri added a comment.

In D109296#3008361 <https://reviews.llvm.org/D109296#3008361>, @fhahn wrote:

> In D109296#2987011 <https://reviews.llvm.org/D109296#2987011>, @lebedev.ri wrote:
>
>> In D109296#2986995 <https://reviews.llvm.org/D109296#2986995>, @fhahn wrote
>>
>>>> My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.
>>>
>>> Do you mean the cost of 12 scalar loop iterations?
>>
>> Yep.
>
> In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
> With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, `RTCost/ScalarCostPerIter` is the new threshold, `NumChecks` is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

> I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M:  (see `VC5Decompressor::Wavelet::reconstructPass()::process()` and `VC5Decompressor::Wavelet::combineLowHighPass()::process()`)
 F19123308: VC5Decompressor.cpp.gcov.html <https://reviews.llvm.org/F19123308>
... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

> I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

> In D109296#2985531 <https://reviews.llvm.org/D109296#2985531>, @lebedev.ri wrote:
>
>> Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
>> I think it showcases the problem quite well. We currently are okay with vectorizing
>> if that means emitting a check with checks=8,members=45,RTCost=146,
>> but not with checks=9,members=18,RTCost=36.
>
> That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109296/new/

https://reviews.llvm.org/D109296