[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

Thu Jun 16 23:05:09 PDT 2016

Thanks Hideki for going through this proposal.

I agree this can be done with Gather/Scatter intrinsic as well, once we 
enable these we need to place right costing. During costing we have to 
estimate the cost of load[s], store[s] and shuffle[s] and in CG prepare
we have to lower them. In the proposed approach we have introduced 
SkipFactor which helps to reduce number of load[s] & store[s]. i.e. For 
Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load operation.

> 4) Unfortunately, 3) most likely imply that the wide load used here needs to be
> based on target HW physical
>      vector width. There are reasons why such type legalization at vectorizer
> should be avoided.

I did not understood this completely, vectorizer already identifies maximum
target HW physical vector width and based on it choose VF then widens the 
instructions.

> If I remember correctly, Elena Demikhovsky (my colleague) attempted to
> introduce stride load/store intrinsic sometime ago and got a push back. I think
> it's best to address the following questions.
> a) What can this enable over gather/scatter intrinsic w/ constant index vector
> b) What can this enable over hypothetical "stride load/store" intrinsic

In this proposal we are not generating any intrinsic rather directly generates 
required instructions based on vector factor. Here instructions width are purely 
based on vector factor. Also we have introduced SkipFactor concept to reduce
number of load[s] & store[s].

But I accept this can be done with hypothetical "stride load/store" intrinsic or 
gather/scatter intrinsic. At vectorizer generate these intrinsic and later during 
CodeGen/CG-Prepare generate mapping instructions.

> c) Any chance of introducing stride load/store as a form of wide load/store?

We already widens the instructions but purely based on vector factor.
i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>.
We are not intentionally generating wide loads like interleave vectorizer
because we transform everything at vectorizer and the wide load will not
help us to minimize the required load[s] & store[s].
i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x 12> wide 
load (which is actually 3 <i32 x 4>) but this approach only generate 2 <i32 x 4> loads.

Regards,
Ashutosh

> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Saito,
> Hideki via llvm-dev
> Sent: Thursday, June 16, 2016 5:10 AM
> To: llvm-dev at lists.llvm.org
> Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> Ashutosh,
> 
> First, I'm all for enabling general stride load/store support for all targets ---
> should be just a matter of proper cost modeling. For that matter, we should
> enable general gather/scatter support for all targets.
> 
> About the specific approach taken by this RFC:
> 1) It's best to first state that this is for small constant stride where a single wide
> load/store can cover two or more
>      valid stride load/store elements. If a wide load/store contain just one valid
> element, there aren't any need
>      to do a wide load/store.
> 2) Optimal code generation for strided memory accesses are target dependent.
>      Breaking one gather/scatter w/ constant index vector into several IR
> instructions like below might hurt such optimization,
>      especially when the vectorization factor is greater than the target legal
> vector width ---- this can happen
>      in mixed data type code (int mixed with double, etc.)
> 3) Make sure that the last element of unmasked load is a valid stride load
> element ---- otherwise,
>      out-of-bounds SEGV can happen. Use of masked load intrinsic most likely
> degrades any benefits over
>      gather intrinsic w/ constant index.
> 4) Unfortunately, 3) most likely imply that the wide load used here needs to be
> based on target HW physical
>      vector width. There are reasons why such type legalization at vectorizer
> should be avoided.
> 5) In general, I think leaking this much complexity into IR hurts (rather than
> enable) optimization.
>      It's best to keep vectorizer's output as simple as vector code can be such that
> other optimizers can do their best
>      in optimizing vector code.
> 
> If I remember correctly, Elena Demikhovsky (my colleague) attempted to
> introduce stride load/store intrinsic sometime ago and got a push back. I think
> it's best to address the following questions.
> a) What can this enable over gather/scatter intrinsic w/ constant index vector
> b) What can this enable over hypothetical "stride load/store" intrinsic
> c) Any chance of introducing stride load/store as a form of wide load/store?
>     I understand that there is a big hurdle here, but the reward should also be
> pretty good here.
> 
> One might claim that the cost modeling can be easier, but that is also the
> downside since the generated cost may not match the cost of optimal code.
> 
> All these things considered, my suggestion is to start from enabling optimal
> code generation (or optimal lowering in CG prepare) from gather/scatter
> intrinsic and reflect that in vectorizer's cost model.
> 
> As a vectorizer centric person, I'd like to eventually see gather/scatter/stride-
> load/store to be supported by IR instructions (i.e., not intrinsics), but I'll leave
> that discussion to another time so that I won't pollute Ashutosh's RFC.
> 
> Thanks,
> Hideki Saito
> Technical Lead of Vectorizer Development Intel Compiler and Languages
> 
> -----------------------------------
> Message: 7
> Date: Wed, 15 Jun 2016 06:16:58 +0000
> From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org>
> To: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> 	Vectorization
> Message-ID:
> 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> mprd12.prod.outlook.com>
> 
> Content-Type: text/plain; charset="utf-8"
> 
> Hi,
> 
> This is a proposal about implementing strided memory access vectorization.
> 
> Currently LLVM does not support strided access vectorization completely,
> some support is available via Interleave vectorization.
> 
> There are two main overheads with strided vectorizations:
> *       An overhead of consolidating data into an operable vector.
> *       An overhead of distributing the data elements after the operations.
> 
> Because of these overheads LLVM finds strided memory vectorization is not
> profitable and generating serial code sequence over vectorized code.
> GATHER & SCATTER instruction support is available on few(not all) targets for
> consolidation & distribution operations.
> 
> In this approach we are trying to handle cases like this:
> for (int i = 0; i < len; i++)
>      a[i*3] = b[i*2] + c[i*3];
> 
> We model strided memory load & store using shuffle & load/mask-store
> operations.
> *       Load is modeled as loads followed by shuffle.
> *       Store is modeled as shuffle followed by mask store.
> *       To minimize load and store operation introduced 'SkipFactor'.
> 
> 'SkipFactor':
> *       Multiple load operation required for consolidating data into an operable
> vector.
> *       Between various loads we skip by few offsets to effective consolidate.
> *       SkipFactor is the number of additional offsets required to move from the
> previous vector load memory-end address for the next vector load.
> *       SkipFactor allows all memory loads to be considered as identical (From
> valid element perspective).
> *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> 
> For example:
> 
> 
> How LOAD is modeled:
> Load stride 3 (i.e. load for b [ 3 * i ])
>   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
>   %6 = bitcast i32* %5 to <4 x i32>*
>   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
>   %7 = getelementptr i32, i32* %5, i64 6
>   %8 = bitcast i32* %7 to <4 x i32>*
>   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
>   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32>
> %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> 
> How STORE is modeled:
> Store with stride 3 (i.e. store to c [ 3 * i ])
>  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
>   %11 = bitcast i32* %10 to <4 x i32>*
>   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32> undef, <4 x
> i32> <i32 0, i32 undef, i32 undef, i32 1>
>   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x i32>*
> %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
>   %12 = getelementptr i32, i32* %10, i64 6
>   %13 = bitcast i32* %12 to <4 x i32>*
>   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x i32> undef,
> <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
>   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, <4 x i32>*
> %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
> 
> NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by 6(4+2).
> 
> To enable this feature below LLVM changes required.
> 1)      Identify strided memory access (Its already there i.e. interleave vectorizer
> does this)
> 2)      Costing changes:
> a.      Identify number of Load[s]/Store[s] & Shuffle[s] required to model
> Load/Store operation by considering SkipFactor.
> b.      Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
> 3)      Transform:
> a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to model a
> Store operation.
> b.      Generate Load[s] followed by Shuffle[s] instructions to model a Load
> operation.
> 
> Implemented a working prototype to demonstrate this feature, its available at
> below review-thread.
> http://reviews.llvm.org/D21363
> 
> Use below option to enable this feature:
> "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-
> vectorization"
> 
> Gains observed with prototype:
> TSVC kernel S111                1.15x
> TSVC kernel S1111               1.42x
> 
> If community is interested in the above work, I have below plan:
> 
> I'm not sure keeping this feature separate or the part of current interleave
> vectorizer ?
> 
> 1)      Will enable interleave vectorizer on X86 (If this feature is going as an
> enhancement on interleave vectorizer)
> 2)      If it's going with interleave vectorizer, will modify its infra to support this.
> 3)      Will enable strided vectorization by submitting changes on costing &
> transform.
> 
> Please share your valuable thoughts.
> 
> Thanks,
> Ashutosh
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev