[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization

Fri Jul 1 03:06:43 PDT 2016

> >Probably can depend on the support provided by below RFC by Michael:
> > "Allow loop vectorizer to choose vector widths that generate illegal types"
> >In that case Loop Vectorizer will consider the cost of bigger types and
> generate gather / scatter of same type. Later code generation will generate the
> desired instruction[s] by breaking the bigger types into target >supported.
> 
> Hopefully, you can come up with a short term workaround, if the progress of
> Michael's work is behind yours.

We can skip those cases for now, whenever the vector factor for a type goes 
beyond the target support will disable stride memory vectorization.
Once Michael work accepted will enable it to consider these cases.

Regards,
Ashutosh

> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Thursday, June 30, 2016 11:12 PM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>; Demikhovsky, Elena
> <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal
> <ayal.zaks at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com;
> renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com;
> Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>;
> sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael
> Kuperstein <mkuper at google.com>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> 
> As a strong advocate of logical vector representation, I'm counting on
> community liking Michael's RFC and that'll proceed sooner than later.
> I plan to pitch in (e.g., perf experiments).
> 
> >Probably can depend on the support provided by below RFC by Michael:
> > "Allow loop vectorizer to choose vector widths that generate illegal types"
> >In that case Loop Vectorizer will consider the cost of bigger types and
> generate gather / scatter of same type. Later code generation will generate the
> desired instruction[s] by breaking the bigger types into target >supported.
> 
> Hopefully, you can come up with a short term workaround, if the progress of
> Michael's work is behind yours.
> 
> Thanks,
> Hideki
> 
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Wednesday, June 29, 2016 9:50 PM
> To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena
> <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal
> <ayal.zaks at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com;
> renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com;
> Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>;
> sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael
> Kuperstein <mkuper at google.com>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
> 
> One common concern raised for cases where Loop Vectorizer generate bigger
> types than target supported:
> 
> Based on VF currently we check the cost and generate the expected set of
> instruction[s] for bigger type. It has two challenges for bigger types cost is not
> always correct and code generation may not generate efficient instruction[s].
> 
> Probably can depend on the support provided by below RFC by Michael:
>  "Allow loop vectorizer to choose vector widths that generate illegal types"
> In that case Loop Vectorizer will consider the cost of bigger types and generate
> gather / scatter of same type. Later code generation will generate the desired
> instruction[s] by breaking the bigger types into target supported.
> 
> Regards,
> Ashutosh
> 
> > -----Original Message-----
> > From: Saito, Hideki [mailto:hideki.saito at intel.com]
> > Sent: Saturday, June 18, 2016 6:00 AM
> > To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> >
> > >Vectorizer's output should be as clean as vector code can be so that
> > >analyses
> > and optimizers downstream can
> > >do a great job optimizing.
> >
> > Guess I should clarify this philosophical position of mine. In terms
> > of vector code optimization that complicates the output of vectorizer:
> >
> > If vectorizer is the best place to perform the optimization, it should do so.
> >      This includes the cases like
> >            If vectorizer doesn't do it, we'll lose the opportunity.
> >            If vectorizer doesn't do it, complexity of performing the
> > optimization goes up significantly.
> >            Optimization changes the way vectorization happens (may
> > consider this as part of "lose the oppotunity") They should be
> > different from
> >            If vectorizer doesn't do it, it'll add complexity of vectorizer's cost model.
> > Compiler's cost modeling in general is already complex enough since
> > there are many things happening downstream. Adding one more
> > optimization downstream doesn't change the big picture.
> >
> > There is certainly a tradeoff against "if vectorizer does this, we'll
> > lose some other optimization downstream"
> > even if vectorizer is the best place to perform a certain optimization
> > ---- but this isn't a new problem.
> >
> > As Ashutosh wrote, stride memref optimization in this RFC can be done
> > later in compilation, and doing so should not be much more complex
> > than doing it in vectorizer. That's why I recommend doing it outside
> > of vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is
> > considering that.
> >
> > Thanks for reading.
> > Hideki
> >
> > -----Original Message-----
> > From: Saito, Hideki
> > Sent: Friday, June 17, 2016 4:40 PM
> > To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> >
> > >I agree this can be done with Gather/Scatter intrinsic as well, once
> > >we enable
> > these we need to place right costing. During costing we have to
> > estimate the cost of load[s], >store[s] and shuffle[s] and in CG
> > prepare we have to lower them. In the proposed approach we have
> > introduced SkipFactor which helps to reduce number of load[s] &
> > >store[s]. i.e. For Stride 3 & VF 4 we only generate
> > 2 loads(vs 3) to model load operation.
> >
> > I'm all for properly accounting the cost. I'm just against leaking
> > unnecessary complexity to the IL.
> >
> > >I did not understood this completely, vectorizer already identifies
> > >maximum
> > target HW physical vector width and based on it choose VF then widens
> > the instructions.
> >
> > Well, the selection of VF should be based on the cost modeling. If you
> > haven't, please follow the thread started by Michael Kuperstein.
> > 	[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
> > generate illegal types In general, a "logical vector" operation can be
> > one full vector on the target, less than one full vector (such as
> > half), or more than one full vector (e.g., two or four vector). LLVM
> > IR instructions can support logical vectors (and have vector type
> > legalization support). I don't see why vectorizer should be restricted
> > to use target legalized vector only.
> >
> > Now, looking at Stride-2 Loads and VF=4 case, where target HW can
> > accommodate 4 elements in the vector.
> > 	A * B * C * D
> > We could try breaking up into two loads in
> > 	A * B *   + C * D *
> > or
> > 	A * B *   + * C * D
> > The second one is fine if we can assume that any memory between A and
> > D can be accessed.
> > However, the first one is bad since the "*" after D may cause SEGV.
> >
> > What if we want to use VF=8? Under the proposal in consideration, some
> > operations (such as ADD) are consuming the result of load in 8-way
> > vector (say, <8 x float>), but the load is somehow split up into
> > 2x2x4. That's not good for optimizers. Load of 8-way vector should
> > look like load of 8 elements --- not a convoluted sequence of smaller
> > loads and bunch of shuffles that analysis/optimization downstream (or
> > even human) will have a hard time understanding that as a load of 8
> > elements.
> >
> > >But I accept this can be done with hypothetical "stride load/store"
> > >intrinsic or
> > gather/scatter intrinsic. At vectorizer generate these intrinsic and
> > later during CodeGen/CG->Prepare generate mapping instructions.
> >
> > Thanks. That's what I'd recommend ---- unless someone can clearly show
> > that adding this much complexity to the vectorizer's output enables
> > some other optimizations (and without hurting optimizations)
> > downstream.
> >
> > Vectorizer's output should be as clean as vector code can be so that
> > analyses and optimizers downstream can do a great job optimizing.
> >
> > Thanks,
> > Hideki
> >
> > -----Original Message-----
> > From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> > Sent: Thursday, June 16, 2016 11:05 PM
> > To: Saito, Hideki <hideki.saito at intel.com>
> > Cc: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> > Thanks Hideki for going through this proposal.
> >
> > I agree this can be done with Gather/Scatter intrinsic as well, once
> > we enable these we need to place right costing. During costing we have
> > to estimate the cost of load[s], store[s] and shuffle[s] and in CG
> > prepare we have to lower them. In the proposed approach we have
> > introduced SkipFactor which helps to reduce number of load[s] &
> > store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load
> operation.
> >
> > > 4) Unfortunately, 3) most likely imply that the wide load used here
> > > needs to be based on target HW physical
> > >      vector width. There are reasons why such type legalization at
> > > vectorizer should be avoided.
> >
> > I did not understood this completely, vectorizer already identifies
> > maximum target HW physical vector width and based on it choose VF then
> > widens the instructions.
> >
> > > If I remember correctly, Elena Demikhovsky (my colleague) attempted
> > > to introduce stride load/store intrinsic sometime ago and got a push
> > > back. I think it's best to address the following questions.
> > > a) What can this enable over gather/scatter intrinsic w/ constant
> > > index vector
> > > b) What can this enable over hypothetical "stride load/store"
> > > intrinsic
> >
> > In this proposal we are not generating any intrinsic rather directly
> > generates required instructions based on vector factor. Here
> > instructions width are purely based on vector factor. Also we have
> > introduced SkipFactor concept to reduce number of load[s] & store[s].
> >
> > But I accept this can be done with hypothetical "stride load/store"
> > intrinsic or gather/scatter intrinsic. At vectorizer generate these
> > intrinsic and later during CodeGen/CG-Prepare generate mapping
> instructions.
> >
> > > c) Any chance of introducing stride load/store as a form of wide load/store?
> >
> > We already widens the instructions but purely based on vector factor.
> > i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>.
> > We are not intentionally generating wide loads like interleave
> > vectorizer because we transform everything at vectorizer and the wide
> > load will not help us to minimize the required load[s] & store[s].
> > i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x
> > 12> wide load (which is actually 3 <i32 x 4>) but this approach only generate 2
> <i32 x 4> loads.
> >
> > Regards,
> > Ashutosh
> >
> > > -----Original Message-----
> > > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> > > Saito, Hideki via llvm-dev
> > > Sent: Thursday, June 16, 2016 5:10 AM
> > > To: llvm-dev at lists.llvm.org
> > > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > > Vectorization
> > >
> > > Ashutosh,
> > >
> > > First, I'm all for enabling general stride load/store support for
> > > all targets --- should be just a matter of proper cost modeling. For
> > > that matter, we should enable general gather/scatter support for all
> targets.
> > >
> > > About the specific approach taken by this RFC:
> > > 1) It's best to first state that this is for small constant stride
> > > where a single wide load/store can cover two or more
> > >      valid stride load/store elements. If a wide load/store contain
> > > just one valid element, there aren't any need
> > >      to do a wide load/store.
> > > 2) Optimal code generation for strided memory accesses are target
> > dependent.
> > >      Breaking one gather/scatter w/ constant index vector into
> > > several IR instructions like below might hurt such optimization,
> > >      especially when the vectorization factor is greater than the
> > > target legal vector width ---- this can happen
> > >      in mixed data type code (int mixed with double, etc.)
> > > 3) Make sure that the last element of unmasked load is a valid
> > > stride load element ---- otherwise,
> > >      out-of-bounds SEGV can happen. Use of masked load intrinsic
> > > most likely degrades any benefits over
> > >      gather intrinsic w/ constant index.
> > > 4) Unfortunately, 3) most likely imply that the wide load used here
> > > needs to be based on target HW physical
> > >      vector width. There are reasons why such type legalization at
> > > vectorizer should be avoided.
> > > 5) In general, I think leaking this much complexity into IR hurts
> > > (rather than
> > > enable) optimization.
> > >      It's best to keep vectorizer's output as simple as vector code
> > > can be such that other optimizers can do their best
> > >      in optimizing vector code.
> > >
> > > If I remember correctly, Elena Demikhovsky (my colleague) attempted
> > > to introduce stride load/store intrinsic sometime ago and got a push
> > > back. I think it's best to address the following questions.
> > > a) What can this enable over gather/scatter intrinsic w/ constant
> > > index vector
> > > b) What can this enable over hypothetical "stride load/store"
> > > intrinsic
> > > c) Any chance of introducing stride load/store as a form of wide load/store?
> > >     I understand that there is a big hurdle here, but the reward
> > > should also be pretty good here.
> > >
> > > One might claim that the cost modeling can be easier, but that is
> > > also the downside since the generated cost may not match the cost of
> > > optimal
> > code.
> > >
> > > All these things considered, my suggestion is to start from enabling
> > > optimal code generation (or optimal lowering in CG prepare) from
> > > gather/scatter intrinsic and reflect that in vectorizer's cost model.
> > >
> > > As a vectorizer centric person, I'd like to eventually see
> > > gather/scatter/stride- load/store to be supported by IR instructions
> > > (i.e., not intrinsics), but I'll leave that discussion to another
> > > time so that I
> > won't pollute Ashutosh's RFC.
> > >
> > > Thanks,
> > > Hideki Saito
> > > Technical Lead of Vectorizer Development Intel Compiler and
> > > Languages
> > >
> > > -----------------------------------
> > > Message: 7
> > > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org>
> > > To: llvm-dev <llvm-dev at lists.llvm.org>
> > > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > > 	Vectorization
> > > Message-ID:
> > > 	<BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > > mprd12.prod.outlook.com>
> > >
> > > Content-Type: text/plain; charset="utf-8"
> > >
> > > Hi,
> > >
> > > This is a proposal about implementing strided memory access vectorization.
> > >
> > > Currently LLVM does not support strided access vectorization
> > > completely, some support is available via Interleave vectorization.
> > >
> > > There are two main overheads with strided vectorizations:
> > > *       An overhead of consolidating data into an operable vector.
> > > *       An overhead of distributing the data elements after the operations.
> > >
> > > Because of these overheads LLVM finds strided memory vectorization
> > > is not profitable and generating serial code sequence over vectorized code.
> > > GATHER & SCATTER instruction support is available on few(not all)
> > > targets for consolidation & distribution operations.
> > >
> > > In this approach we are trying to handle cases like this:
> > > for (int i = 0; i < len; i++)
> > >      a[i*3] = b[i*2] + c[i*3];
> > >
> > > We model strided memory load & store using shuffle & load/mask-store
> > > operations.
> > > *       Load is modeled as loads followed by shuffle.
> > > *       Store is modeled as shuffle followed by mask store.
> > > *       To minimize load and store operation introduced 'SkipFactor'.
> > >
> > > 'SkipFactor':
> > > *       Multiple load operation required for consolidating data into an
> operable
> > > vector.
> > > *       Between various loads we skip by few offsets to effective consolidate.
> > > *       SkipFactor is the number of additional offsets required to move from
> the
> > > previous vector load memory-end address for the next vector load.
> > > *       SkipFactor allows all memory loads to be considered as identical (From
> > > valid element perspective).
> > > *       SkipFactor = (Stride - (VF % Stride)) % Stride)
> > >
> > > For example:
> > >
> > >
> > > How LOAD is modeled:
> > > Load stride 3 (i.e. load for b [ 3 * i ])
> > >   %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> > >   %6 = bitcast i32* %5 to <4 x i32>*
> > >   %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
> > >   %7 = getelementptr i32, i32* %5, i64 6
> > >   %8 = bitcast i32* %7 to <4 x i32>*
> > >   %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
> > >   %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32>
> > > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> > >
> > > How STORE is modeled:
> > > Store with stride 3 (i.e. store to c [ 3 * i ])
> > >  %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> > >   %11 = bitcast i32* %10 to <4 x i32>*
> > >   %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32>
> > > undef, <4 x
> > > i32> <i32 0, i32 undef, i32 undef, i32 1>
> > >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4
> > > x
> > > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
> > >   %12 = getelementptr i32, i32* %10, i64 6
> > >   %13 = bitcast i32* %12 to <4 x i32>*
> > >   %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x
> > > i32> undef,
> > > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> > >   call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30,
> > > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1
> > > true>)
> > >
> > > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by
> 6(4+2).
> > >
> > > To enable this feature below LLVM changes required.
> > > 1)      Identify strided memory access (Its already there i.e. interleave
> > vectorizer
> > > does this)
> > > 2)      Costing changes:
> > > a.      Identify number of Load[s]/Store[s] & Shuffle[s] required to model
> > > Load/Store operation by considering SkipFactor.
> > > b.      Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
> > > 3)      Transform:
> > > a.      Generate Shuffle[s] followed by Mask-Store[s] instructions to model a
> > > Store operation.
> > > b.      Generate Load[s] followed by Shuffle[s] instructions to model a Load
> > > operation.
> > >
> > > Implemented a working prototype to demonstrate this feature, its
> > > available at below review-thread.
> > > http://reviews.llvm.org/D21363
> > >
> > > Use below option to enable this feature:
> > > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-
> > > vectorization"
> > >
> > > Gains observed with prototype:
> > > TSVC kernel S111                1.15x
> > > TSVC kernel S1111               1.42x
> > >
> > > If community is interested in the above work, I have below plan:
> > >
> > > I'm not sure keeping this feature separate or the part of current
> > > interleave vectorizer ?
> > >
> > > 1)      Will enable interleave vectorizer on X86 (If this feature is going as an
> > > enhancement on interleave vectorizer)
> > > 2)      If it's going with interleave vectorizer, will modify its infra to support
> this.
> > > 3)      Will enable strided vectorization by submitting changes on costing &
> > > transform.
> > >
> > > Please share your valuable thoughts.
> > >
> > > Thanks,
> > > Ashutosh
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev