[PATCH] D109296: [LV] Improve inclusivity of vectorization

Sat Sep 25 12:03:13 PDT 2021

fhahn added a comment.

In D109296#3008367 <https://reviews.llvm.org/D109296#3008367>, @lebedev.ri wrote:

> In D109296#3008361 <https://reviews.llvm.org/D109296#3008361>, @fhahn wrote:
>
>> In D109296#2987011 <https://reviews.llvm.org/D109296#2987011>, @lebedev.ri wrote:
>>
>>> In D109296#2986995 <https://reviews.llvm.org/D109296#2986995>, @fhahn wrote
>>>
>>>>> My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.
>>>>
>>>> Do you mean the cost of 12 scalar loop iterations?
>>>
>>> Yep.
>>
>> In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
>> With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?
>
> See the spreadsheet linked in the description, `RTCost/ScalarCostPerIter` is the new threshold, `NumChecks` is the old "threshold".
> With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
> But obviously this threshold is going going to vary somewhat per codebase,
> as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.
>
>> I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?
>
> It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
> it's ~10M:  (see `VC5Decompressor::Wavelet::reconstructPass()::process()` and `VC5Decompressor::Wavelet::combineLowHighPass()::process()`)
>  F19123308: VC5Decompressor.cpp.gcov.html <https://reviews.llvm.org/F19123308>
> ... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
> For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.
>
>> I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)
>
> FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
> Perhaps PGO counters could be useful for that.

I wasn't really thinking of a pragma (although it might be helpful in some cases), but a new compiler option (like `-fhigh-trip-count-assumption)
PGO should definitely be helpful here.

>> In D109296#2985531 <https://reviews.llvm.org/D109296#2985531>, @lebedev.ri wrote:
>>
>>> Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
>>> I think it showcases the problem quite well. We currently are okay with vectorizing
>>> if that means emitting a check with checks=8,members=45,RTCost=146,
>>> but not with checks=9,members=18,RTCost=36.
>>
>> That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.
>
> The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
> some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368 <https://reviews.llvm.org/D109368>, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 <https://reviews.llvm.org/D109368> accordingly.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109296/new/

https://reviews.llvm.org/D109296