[PATCH] D102834: [SLPVectorizer] Implement initial memory versioning.

Thu Jul 29 10:20:51 PDT 2021

ABataev added a comment.

In D102834#2914099 <https://reviews.llvm.org/D102834#2914099>, @fhahn wrote:

> In D102834#2866603 <https://reviews.llvm.org/D102834#2866603>, @SjoerdMeijer wrote:
>
>> Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.
>
> The case from x264 is in the function `@f_alias` in `../AArch64/loadi8.ll`, right?
>
> With versioning, the SLP vectorizer generates the following vector block:
>
>   entry.slpversioned:                               ; preds = %entry.slpmemcheck
>     %scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
>     %0 = load i32, i32* %scale, align 16
>     %offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
>     %1 = load i32, i32* %offset, align 4
>     %arrayidx.1 = getelementptr inbounds i8, i8* %src, i64 1
>     %arrayidx2.1 = getelementptr inbounds i8, i8* %dst, i64 1
>     %arrayidx.2 = getelementptr inbounds i8, i8* %src, i64 2
>     %arrayidx2.2 = getelementptr inbounds i8, i8* %dst, i64 2
>     %arrayidx.3 = getelementptr inbounds i8, i8* %src, i64 3
>     %2 = bitcast i8* %src to <4 x i8>*
>     %3 = load <4 x i8>, <4 x i8>* %2, align 1, !alias.scope !0, !noalias !3
>     %4 = zext <4 x i8> %3 to <4 x i32>
>     %5 = insertelement <4 x i32> poison, i32 %0, i32 0
>     %6 = insertelement <4 x i32> %5, i32 %0, i32 1
>     %7 = insertelement <4 x i32> %6, i32 %0, i32 2
>     %8 = insertelement <4 x i32> %7, i32 %0, i32 3
>     %9 = mul nsw <4 x i32> %8, %4
>     %10 = insertelement <4 x i32> poison, i32 %1, i32 0
>     %11 = insertelement <4 x i32> %10, i32 %1, i32 1
>     %12 = insertelement <4 x i32> %11, i32 %1, i32 2
>     %13 = insertelement <4 x i32> %12, i32 %1, i32 3
>     %14 = add nsw <4 x i32> %9, %13
>     %15 = icmp ult <4 x i32> %14, <i32 256, i32 256, i32 256, i32 256>
>     %16 = icmp sgt <4 x i32> %14, zeroinitializer
>     %17 = sext <4 x i1> %16 to <4 x i32>
>     %18 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %17
>     %19 = trunc <4 x i32> %18 to <4 x i8>
>     %arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3
>     %20 = bitcast i8* %dst to <4 x i8>*
>     store <4 x i8> %19, <4 x i8>* %20, align 1, !alias.scope !3, !noalias !0
>     br label %entry.merge
>
> The problem there is that currently the cost of the vector block is compared to the cost of the original scalar block. In the case at hand the vectorized IR is not optimal and to cost gets over-estimated, causing the versioned block to be dropped as unprofitable.
>
> We can tackle that in different ways: a) have the SLP vectorizer generated more optimal vector IR (e.g. `shuffle` instead of chain of `inserts`) or b) post-process the IR before computing the cost a bit. Not sure how much work a) would be. @ABataev any ideas?

We can easily generate shuffles for splats/reused scalars, will implement this ASAP.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102834/new/

https://reviews.llvm.org/D102834