[PATCH] D50665: [LV][LAA] Vectorize loop invariant values stored into loop invariant address

Fri Aug 24 13:11:17 PDT 2018

hsaito added a comment.

>>> For other architectures that do not have these masked intrinsics, we just generate the predicated store by doing an extract and branch on each lane (correct but inefficient and will be avoided unless -force-vector-width=X).
>> 
>> In general, self output dependence is fine to vectorize (whether the store address is uniform or random), as long as (masked) scatter (or scatter emulation) happens from lower elements to higher elements.
> 
> I don't think the above comment matters for uniform addresses because a uniform address is invariant.

Only if you are storing uniform value.

> This is what the langref states for scatter intrinsic (https://llvm.org/docs/LangRef.html#id1792):
> 
>   . The data stored in memory is a vector of any integer, floating-point or pointer data type. Each vector element is stored in an arbitrary memory address. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element.

Thanks for reminding me that the intrinsic is defined with the ordering requirement.

We should also consider doing this, depending on the cost of branch versus masked scatter. For the targets w/o masked scatter, this should be better than masked scatter emulation.

%5 = bitcast <16xi1> %4 to <i16>
%6 = icmp eq <i16> %5, <i16> zero
br <i1> %6 skip fall
fall:
store <i32> %ntrunc, <i32*> %a
br skip
skip:

Repository:
  rL LLVM

https://reviews.llvm.org/D50665