[PATCH] D50665: [LV][LAA] Vectorize loop invariant values stored into loop invariant address

Fri Aug 24 12:47:30 PDT 2018

anna added a comment.

In https://reviews.llvm.org/D50665#1212637, @hsaito wrote:

> In https://reviews.llvm.org/D50665#1212597, @anna wrote:
>
> > One more interesting thing I noticed while adding predicated invariant stores to X86 (for -mcpu=skylake-avx512), it supports masked scatter for non-unniform stores.
> >  But we need to add support for uniform stores along with this patch. Today, it just generates incorrect code (no predication whatsover). 
> >  For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).
>
>
> In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.

I don't think the above comment matters for uniform addresses because a uniform address is invariant. This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):

  . The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

The scatter address is not overlapping for the uniform address. It is the exact same address. This is the code that gets generated for uniform stores on skylake with AVX-512 support once I fixed the bug in this patch  (the scatter location is the same address and the stored value is also the same, and the mask is the vector of booleans):
pseudo code:

  if (b[i] ==k)
    a = ntrunc; <-- uniform store based on condition above.

IR generated:

  vector.ph:
    %broadcast.splatinsert5 = insertelement <16 x i32> undef, i32 %k, i32 0
    %broadcast.splat6 = shufflevector <16 x i32> %broadcast.splatinsert5, <16 x i32> undef, <16 x i32> zeroinitializer <-- vector splat of k
    %broadcast.splatinsert9 = insertelement <16 x i32*> undef, i32* %a, i32 0
    %broadcast.splat10 = shufflevector <16 x i32*> %broadcast.splatinsert9, <16 x i32*> undef, <16 x i32> zeroinitializer <-- vector splat of i32* a.

  vector.body:
   %2 = getelementptr inbounds i32, i32* %b, i64 %index
    %3 = bitcast i32* %2 to <16 x i32>*
    %wide.load = load <16 x i32>, <16 x i32>* %3, align 8
    %4 = icmp eq <16 x i32> %wide.load, %broadcast.splat6
  call void @llvm.masked.scatter.v16i32.v16p0i32(<16 x i32> %broadcast.splat8, <16 x i32*> %broadcast.splat10, i32 4, <16 x i1> %4) <--scatter storing the same element into the same address (a), depending on same condition b[i] == k

Repository:
  rL LLVM

https://reviews.llvm.org/D50665