[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

Sat Jun 18 04:43:36 PDT 2016

> Now, looking at Stride-2 Loads and VF=4 case, where target HW can
> accommodate 4 elements in the vector.
> 	A * B * C * D
> We could try breaking up into two loads in
> 	A * B *   + C * D *
> or
> 	A * B *   + * C * D
> The second one is fine if we can assume that any memory between A and D
> can be accessed.
> However, the first one is bad since the "*" after D may cause SEGV.

True, boundary element access may be risky.
To facilitate this we have introduced generating of masked load 
for the last-required-load in a loop iteration. 
This can be enabled by providing “-enable-strided-mask-load” option.

i.e. For arr[2*i] with VF 8 will generate below instructions:

  %12 = getelementptr inbounds i32, i32* %arr, i64 %11
  %13 = bitcast i32* %12 to <8 x i32>*
  %stride.load = load <8 x i32>, <8 x i32>* %13, align 4, !tbaa !1
  %14 = getelementptr i32, i32* %12, i64 8
  %15 = bitcast i32* %14 to <8 x i32>*
  %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %15, i32 4, <8 x i1> <i1 true, i1 false, i1 true, i1 false, i1 true, i1 false, i1 true, i1 false>, <8 x i32> undef), !tbaa !1
  %strided.vec = shufflevector <8 x i32> %stride.load, <8 x i32> %wide.masked.load, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>

NOTE: the last load is a masked load.

> What if we want to use VF=8? Under the proposal in consideration, some
> operations (such as ADD)
> are consuming the result of load in 8-way vector (say, <8 x float>), but the load
> is somehow split up
> into 2x2x4. That's not good for optimizers. Load of 8-way vector should look like
> load of 8 elements ---
> not a convoluted sequence of smaller loads and bunch of shuffles that
> analysis/optimization downstream
> (or even human) will have a hard time understanding that as a load of 8
> elements.

This approach widens the instructions purely based on vector factor. 
For VF8 the intent is to generate, 8 width loads(i.e. <8 x i32>) so we generating it.

i.e. For arr[2*i] with VF 8 will generate below instructions:

  %12 = getelementptr inbounds i32, i32* %arr, i64 %11
  %13 = bitcast i32* %12 to <8 x i32>*
  %stride.load = load <8 x i32>, <8 x i32>* %13, align 4, !tbaa !1
  %14 = getelementptr i32, i32* %12, i64 8
  %15 = bitcast i32* %14 to <8 x i32>*
  %stride.load27 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !1
  %strided.vec = shufflevector <8 x i32> %stride.load, <8 x i32> %stride.load27, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>

NOTE: It will only generates 2 loads of width 8.

Thanks,
Ashutosh

> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Saturday, June 18, 2016 5:10 AM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> 
> >I agree this can be done with Gather/Scatter intrinsic as well, once we enable
> these we need to place right costing. During costing we have to estimate the
> cost of load[s], >store[s] and shuffle[s] and in CG prepare we have to lower
> them. In the proposed approach we have introduced SkipFactor which helps to
> reduce number of load[s] & >store[s]. i.e. For Stride 3 & VF 4 we only generate
> 2 loads(vs 3) to model load operation.
> 
> I'm all for properly accounting the cost. I'm just against leaking unnecessary
> complexity to the IL.
> 
> >I did not understood this completely, vectorizer already identifies maximum
> target HW physical vector width and based on it choose VF then widens the
> instructions.
> 
> Well, the selection of VF should be based on the cost modeling. If you haven't,
> please follow the thread started by Michael Kuperstein.
> 	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
> generate illegal types
> In general, a "logical vector" operation can be one full vector on the target, less
> than one full vector (such as half), or more than one full vector
> (e.g., two or four vector). LLVM IR instructions can support logical vectors (and
> have vector type legalization support). I don't see why vectorizer
> should be restricted to use target legalized vector only.
> 
> Now, looking at Stride-2 Loads and VF=4 case, where target HW can
> accommodate 4 elements in the vector.
> 	A * B * C * D
> We could try breaking up into two loads in
> 	A * B *   + C * D *
> or
> 	A * B *   + * C * D
> The second one is fine if we can assume that any memory between A and D
> can be accessed.
> However, the first one is bad since the "*" after D may cause SEGV.
> 
> What if we want to use VF=8? Under the proposal in consideration, some
> operations (such as ADD)
> are consuming the result of load in 8-way vector (say, <8 x float>), but the load
> is somehow split up
> into 2x2x4. That's not good for optimizers. Load of 8-way vector should look like
> load of 8 elements ---
> not a convoluted sequence of smaller loads and bunch of shuffles that
> analysis/optimization downstream
> (or even human) will have a hard time understanding that as a load of 8
> elements.
> 
> >But I accept this can be done with hypothetical "stride load/store" intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and later during
> CodeGen/CG->Prepare generate mapping instructions.
> 
> Thanks. That's what I'd recommend ---- unless someone can clearly show that
> adding this much complexity to the vectorizer's output
> enables some other optimizations (and without hurting optimizations)
> downstream.
> 
> Vectorizer's output should be as clean as vector code can be so that analyses
> and optimizers downstream can
> do a great job optimizing.
> 
> Thanks,
> Hideki
> 
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Thursday, June 16, 2016 11:05 PM
> To: Saito, Hideki <hideki.saito at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> Thanks Hideki for going through this proposal.
> 
> I agree this can be done with Gather/Scatter intrinsic as well, once we enable
> these we need to place right costing. During costing we have to estimate the
> cost of load[s], store[s] and shuffle[s] and in CG prepare we have to lower
> them. In the proposed approach we have introduced SkipFactor which helps to
> reduce number of load[s] & store[s]. i.e. For Stride 3 & VF 4 we only generate 2
> loads(vs 3) to model load operation.
> 
> > 4) Unfortunately, 3) most likely imply that the wide load used here
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at
> > vectorizer should be avoided.
> 
> I did not understood this completely, vectorizer already identifies maximum
> target HW physical vector width and based on it choose VF then widens the
> instructions.
> 
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted to
> > introduce stride load/store intrinsic sometime ago and got a push
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant
> > index vector
> > b) What can this enable over hypothetical "stride load/store"
> > intrinsic
> 
> In this proposal we are not generating any intrinsic rather directly generates
> required instructions based on vector factor. Here instructions width are purely
> based on vector factor. Also we have introduced SkipFactor concept to reduce
> number of load[s] & store[s].
> 
> But I accept this can be done with hypothetical "stride load/store" intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and later during
> CodeGen/CG-Prepare generate mapping instructions.
> 
> > c) Any chance of introducing stride load/store as a form of wide load/store?
> 
> We already widens the instructions but purely based on vector factor.
> i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>.
> We are not intentionally generating wide loads like interleave vectorizer
> because we transform everything at vectorizer and the wide load will not help
> us to minimize the required load[s] & store[s].
> i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x 12> wide load
> (which is actually 3 <i32 x 4>) but this approach only generate 2 <i32 x 4> loads.
> 
> Regards,
> Ashutosh
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> > Saito, Hideki via llvm-dev
> > Sent: Thursday, June 16, 2016 5:10 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> > Ashutosh,
> >
> > First, I'm all for enabling general stride load/store support for all
> > targets --- should be just a matter of proper cost modeling. For that
> > matter, we should enable general gather/scatter support for all targets.
> >
> > About the specific approach taken by this RFC:
> > 1) It's best to first state that this is for small constant stride
> > where a single wide load/store can cover two or more
> >      valid stride load/store elements. If a wide load/store contain
> > just one valid element, there aren't any need
> >      to do a wide load/store.
> > 2) Optimal code generation for strided memory accesses are target
> dependent.
> >      Breaking one gather/scatter w/ constant index vector into several
> > IR instructions like below might hurt such optimization,
> >      especially when the vectorization factor is greater than the
> > target legal vector width ---- this can happen
> >      in mixed data type code (int mixed with double, etc.)
> > 3) Make sure that the last element of unmasked load is a valid stride
> > load element ---- otherwise,
> >      out-of-bounds SEGV can happen. Use of masked load intrinsic most
> > likely degrades any benefits over
> >      gather intrinsic w/ constant index.
> > 4) Unfortunately, 3) most likely imply that the wide load used here
> > needs to be based on target HW physical
> >      vector width. There are reasons why such type legalization at
> > vectorizer should be avoided.
> > 5) In general, I think leaking this much complexity into IR hurts
> > (rather than
> > enable) optimization.
> >      It's best to keep vectorizer's output as simple as vector code
> > can be such that other optimizers can do their best
> >      in optimizing vector code.
> >
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted to
> > introduce stride load/store intrinsic sometime ago and got a push
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant
> > index vector
> > b) What can this enable over hypothetical "stride load/store"
> > intrinsic
> > c) Any chance of introducing stride load/store as a form of wide load/store?
> >     I understand that there is a big hurdle here, but the reward
> > should also be pretty good here.
> >
> > One might claim that the cost modeling can be easier, but that is also
> > the downside since the generated cost may not match the cost of optimal
> code.
> >
> > All these things considered, my suggestion is to start from enabling
> > optimal code generation (or optimal lowering in CG prepare) from
> > gather/scatter intrinsic and reflect that in vectorizer's cost model.
> >
> > As a vectorizer centric person, I'd like to eventually see
> > gather/scatter/stride- load/store to be supported by IR instructions
> > (i.e., not intrinsics), but I'll leave that discussion to another time so that I
> won't pollute Ashutosh's RFC.
> >
> > Thanks,
> > Hideki Saito
> > Technical Lead of Vectorizer Development Intel Compiler and Languages
> >
> > -----------------------------------
> > Message: 7
> > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org>
> > To: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > 	Vectorization
> > Message-ID:
> > 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > mprd12.prod.outlook.com>
> >
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi,
> >
> > This is a proposal about implementing strided memory access vectorization.
> >
> > Currently LLVM does not support strided access vectorization
> > completely, some support is available via Interleave vectorization.
> >
> > There are two main overheads with strided vectorizations:
> > *       An overhead of consolidating data into an operable vector.
> > *       An overhead of distributing the data elements after the operations.
> >
> > Because of these overheads LLVM finds strided memory vectorization is
> > not profitable and generating serial code sequence over vectorized code.
> > GATHER & SCATTER instruction support is available on few(not all)
> > targets for consolidation & distribution operations.
> >
> > In this approach we are trying to handle cases like this:
> > for (int i = 0; i < len; i++)
> >      a[i*3] = b[i*2] + c[i*3];
> >
> > We model strided memory load & store using shuffle & load/mask-store
> > operations.
> > *       Load is modeled as loads followed by shuffle.
> > *       Store is modeled as shuffle followed by mask store.
> > *       To minimize load and store operation introduced 'SkipFactor'.
> >
> > 'SkipFactor':
> > *       Multiple load operation required for consolidating data into an operable
> > vector.
> > *       Between various loads we skip by few offsets to effective consolidate.
> > *       SkipFactor is the number of additional offsets required to move from the
> > previous vector load memory-end address for the next vector load.
> > *       SkipFactor allows all memory loads to be considered as identical (From
> > valid element perspective).
> > *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> >
> > For example:
> >
> >
> > How LOAD is modeled:
> > Load stride 3 (i.e. load for b [ 3 * i ])
> >   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> >   %6 = bitcast i32* %5 to <4 x i32>*
> >   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
> >   %7 = getelementptr i32, i32* %5, i64 6
> >   %8 = bitcast i32* %7 to <4 x i32>*
> >   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
> >   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32>
> > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> >
> > How STORE is modeled:
> > Store with stride 3 (i.e. store to c [ 3 * i ])
> >  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> >   %11 = bitcast i32* %10 to <4 x i32>*
> >   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32>
> > undef, <4 x
> > i32> <i32 0, i32 undef, i32 undef, i32 1>
> >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4 x
> > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
> >   %12 = getelementptr i32, i32* %10, i64 6
> >   %13 = bitcast i32* %12 to <4 x i32>*
> >   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x i32>
> > undef,
> > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30, <4
> > x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
> >
> > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by 6(4+2).
> >
> > To enable this feature below LLVM changes required.
> > 1)      Identify strided memory access (Its already there i.e. interleave
> vectorizer
> > does this)
> > 2)      Costing changes:
> > a.      Identify number of Load[s]/Store[s] & Shuffle[s] required to model
> > Load/Store operation by considering SkipFactor.
> > b.      Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
> > 3)      Transform:
> > a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to model a
> > Store operation.
> > b.      Generate Load[s] followed by Shuffle[s] instructions to model a Load
> > operation.
> >
> > Implemented a working prototype to demonstrate this feature, its
> > available at below review-thread.
> > http://reviews.llvm.org/D21363
> >
> > Use below option to enable this feature:
> > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-
> > vectorization"
> >
> > Gains observed with prototype:
> > TSVC kernel S111                1.15x
> > TSVC kernel S1111               1.42x
> >
> > If community is interested in the above work, I have below plan:
> >
> > I'm not sure keeping this feature separate or the part of current
> > interleave vectorizer ?
> >
> > 1)      Will enable interleave vectorizer on X86 (If this feature is going as an
> > enhancement on interleave vectorizer)
> > 2)      If it's going with interleave vectorizer, will modify its infra to support this.
> > 3)      Will enable strided vectorization by submitting changes on costing &
> > transform.
> >
> > Please share your valuable thoughts.
> >
> > Thanks,
> > Ashutosh
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev