[PATCH] D102748: [LoopUnroll] Don't unroll before vectorisation

Wed May 19 03:38:24 PDT 2021

SjoerdMeijer added a comment.

In D102748#2768019 <https://reviews.llvm.org/D102748#2768019>, @fhahn wrote:

> In D102748#2767983 <https://reviews.llvm.org/D102748#2767983>, @SjoerdMeijer wrote:
>
>> I believe unrolling before vectorisation is fundamentally the wrong approach.
>
> It is indeed suboptimal for a subset of loops which can be vectorized by LV and the SLP.
>
> But in a lot other cases, early unrolling enables other passes to perform many additional simplifications as @nikic mentioned and I think it is very easy to come up with examples to show that, because a lot of simplification passes don't work well on loops. Just one example below. With early unrolling, LLVM will eliminate the memset, without early unrolling it won't. I would expect simplifications due to early unrolling to be quite helpful for a lot of general non-benchmark code with few vectorizable loops.
>
>   void foo(char *Ptr) {
>       memset(Ptr, 0, 16);
>   
>       for (unsigned i = 0; i < 16; i++ )
>         Ptr[i] = i+1;
>   }

I do see your point, but the funny thing is that this gives exactly the same codegen (just a load and a store).

With this patch:

- The loop gets vectorised 16 wide (i.e., it has one iteration)
- Then the unroller comes along, completely unrolls this, so we get rid of the loop,
- The memset is expanded in instruction selection, and then things get combined away to a load/store.

Before:

- The loop gets fully unrolled early, so we have the memset and a block with the loop fully unrolled,
- GlobalOptPass comes along: and sees the memset is globally dead and removes: what remains is just the unrolled block.
- The SLP kicks in and vectorises this block.

All with the same result. So in a way this is an advertisement for skipping the fully unroller early. But like I said, I understand the point, and it was not my intention to skip fully unrolling, I just wanted it after the loop vectoriser.
Also, I was expecting that if was a terrible idea, I would have expected this to be flagged up by SPEC as it contains some different codes; but fair enough, I have run only SPEC and the embedded benchmarks.

>> This, I think, also relies on the loop vectoriser which seems more powerful than SLP vectorisation currently
>
> One major difference is that LV can use runtime checks to ensure memory access do not alias. I suspect that's the main issue blocking SLP in the case in the description.

Yep, you're exactly right here. But this is quite important here that makes all the difference.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102748/new/

https://reviews.llvm.org/D102748