[llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
Saito, Hideki via llvm-dev
llvm-dev at lists.llvm.org
Thu Jun 30 10:41:32 PDT 2016
As a strong advocate of logical vector representation, I'm counting on community liking Michael's RFC and that'll proceed sooner than later.
I plan to pitch in (e.g., perf experiments).
>Probably can depend on the support provided by below RFC by Michael:
> "Allow loop vectorizer to choose vector widths that generate illegal types"
>In that case Loop Vectorizer will consider the cost of bigger types and generate gather / scatter of same type. Later code generation will generate the desired instruction[s] by breaking the bigger types into target >supported.
Hopefully, you can come up with a short term workaround, if the progress of Michael's work is behind yours.
Thanks,
Hideki
-----Original Message-----
From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Wednesday, June 29, 2016 9:50 PM
To: Saito, Hideki <hideki.saito at intel.com>; Demikhovsky, Elena <elena.demikhovsky at intel.com>; silviu.baranga at gmail.com; Zaks, Ayal <ayal.zaks at intel.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>; asbirlea at google.com; renato.golin at linaro.org; mssimpso at codeaurora.org; kv.bhat at samsung.com; Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; sanjoy at playingwithpointers.com; mzolotukhin at apple.com; Michael Kuperstein <mkuper at google.com>
Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access Vectorization
One common concern raised for cases where Loop Vectorizer generate bigger types than target supported:
Based on VF currently we check the cost and generate the expected set of instruction[s] for bigger type. It has two challenges for bigger types cost is not always correct and code generation may not generate efficient instruction[s].
Probably can depend on the support provided by below RFC by Michael:
"Allow loop vectorizer to choose vector widths that generate illegal types"
In that case Loop Vectorizer will consider the cost of bigger types and generate gather / scatter of same type. Later code generation will generate the desired instruction[s] by breaking the bigger types into target supported.
Regards,
Ashutosh
> -----Original Message-----
> From: Saito, Hideki [mailto:hideki.saito at intel.com]
> Sent: Saturday, June 18, 2016 6:00 AM
> To: Nema, Ashutosh <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> Vectorization
>
>
> >Vectorizer's output should be as clean as vector code can be so that
> >analyses
> and optimizers downstream can
> >do a great job optimizing.
>
> Guess I should clarify this philosophical position of mine. In terms
> of vector code optimization that complicates the output of vectorizer:
>
> If vectorizer is the best place to perform the optimization, it should do so.
> This includes the cases like
> If vectorizer doesn't do it, we'll lose the opportunity.
> If vectorizer doesn't do it, complexity of performing the
> optimization goes up significantly.
> Optimization changes the way vectorization happens (may
> consider this as part of "lose the oppotunity") They should be
> different from
> If vectorizer doesn't do it, it'll add complexity of vectorizer's cost model.
> Compiler's cost modeling in general is already complex enough since
> there are many things happening downstream. Adding one more
> optimization downstream doesn't change the big picture.
>
> There is certainly a tradeoff against "if vectorizer does this, we'll
> lose some other optimization downstream"
> even if vectorizer is the best place to perform a certain optimization
> ---- but this isn't a new problem.
>
> As Ashutosh wrote, stride memref optimization in this RFC can be done
> later in compilation, and doing so should not be much more complex
> than doing it in vectorizer. That's why I recommend doing it outside
> of vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is
> considering that.
>
> Thanks for reading.
> Hideki
>
> -----Original Message-----
> From: Saito, Hideki
> Sent: Friday, June 17, 2016 4:40 PM
> To: 'Nema, Ashutosh' <Ashutosh.Nema at amd.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> Vectorization
>
>
> >I agree this can be done with Gather/Scatter intrinsic as well, once
> >we enable
> these we need to place right costing. During costing we have to
> estimate the cost of load[s], >store[s] and shuffle[s] and in CG
> prepare we have to lower them. In the proposed approach we have
> introduced SkipFactor which helps to reduce number of load[s] &
> >store[s]. i.e. For Stride 3 & VF 4 we only generate
> 2 loads(vs 3) to model load operation.
>
> I'm all for properly accounting the cost. I'm just against leaking
> unnecessary complexity to the IL.
>
> >I did not understood this completely, vectorizer already identifies
> >maximum
> target HW physical vector width and based on it choose VF then widens
> the instructions.
>
> Well, the selection of VF should be based on the cost modeling. If you
> haven't, please follow the thread started by Michael Kuperstein.
> [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that
> generate illegal types In general, a "logical vector" operation can be
> one full vector on the target, less than one full vector (such as
> half), or more than one full vector (e.g., two or four vector). LLVM
> IR instructions can support logical vectors (and have vector type
> legalization support). I don't see why vectorizer should be restricted
> to use target legalized vector only.
>
> Now, looking at Stride-2 Loads and VF=4 case, where target HW can
> accommodate 4 elements in the vector.
> A * B * C * D
> We could try breaking up into two loads in
> A * B * + C * D *
> or
> A * B * + * C * D
> The second one is fine if we can assume that any memory between A and
> D can be accessed.
> However, the first one is bad since the "*" after D may cause SEGV.
>
> What if we want to use VF=8? Under the proposal in consideration, some
> operations (such as ADD) are consuming the result of load in 8-way
> vector (say, <8 x float>), but the load is somehow split up into
> 2x2x4. That's not good for optimizers. Load of 8-way vector should
> look like load of 8 elements --- not a convoluted sequence of smaller
> loads and bunch of shuffles that analysis/optimization downstream (or
> even human) will have a hard time understanding that as a load of 8
> elements.
>
> >But I accept this can be done with hypothetical "stride load/store"
> >intrinsic or
> gather/scatter intrinsic. At vectorizer generate these intrinsic and
> later during CodeGen/CG->Prepare generate mapping instructions.
>
> Thanks. That's what I'd recommend ---- unless someone can clearly show
> that adding this much complexity to the vectorizer's output enables
> some other optimizations (and without hurting optimizations)
> downstream.
>
> Vectorizer's output should be as clean as vector code can be so that
> analyses and optimizers downstream can do a great job optimizing.
>
> Thanks,
> Hideki
>
> -----Original Message-----
> From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> Sent: Thursday, June 16, 2016 11:05 PM
> To: Saito, Hideki <hideki.saito at intel.com>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>
> Subject: RE: [llvm-dev] [Proposal][RFC] Strided Memory Access
> Vectorization
>
> Thanks Hideki for going through this proposal.
>
> I agree this can be done with Gather/Scatter intrinsic as well, once
> we enable these we need to place right costing. During costing we have
> to estimate the cost of load[s], store[s] and shuffle[s] and in CG
> prepare we have to lower them. In the proposed approach we have
> introduced SkipFactor which helps to reduce number of load[s] &
> store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load operation.
>
> > 4) Unfortunately, 3) most likely imply that the wide load used here
> > needs to be based on target HW physical
> > vector width. There are reasons why such type legalization at
> > vectorizer should be avoided.
>
> I did not understood this completely, vectorizer already identifies
> maximum target HW physical vector width and based on it choose VF then
> widens the instructions.
>
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted
> > to introduce stride load/store intrinsic sometime ago and got a push
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant
> > index vector
> > b) What can this enable over hypothetical "stride load/store"
> > intrinsic
>
> In this proposal we are not generating any intrinsic rather directly
> generates required instructions based on vector factor. Here
> instructions width are purely based on vector factor. Also we have
> introduced SkipFactor concept to reduce number of load[s] & store[s].
>
> But I accept this can be done with hypothetical "stride load/store"
> intrinsic or gather/scatter intrinsic. At vectorizer generate these
> intrinsic and later during CodeGen/CG-Prepare generate mapping instructions.
>
> > c) Any chance of introducing stride load/store as a form of wide load/store?
>
> We already widens the instructions but purely based on vector factor.
> i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>.
> We are not intentionally generating wide loads like interleave
> vectorizer because we transform everything at vectorizer and the wide
> load will not help us to minimize the required load[s] & store[s].
> i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x
> 12> wide load (which is actually 3 <i32 x 4>) but this approach only generate 2 <i32 x 4> loads.
>
> Regards,
> Ashutosh
>
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> > Saito, Hideki via llvm-dev
> > Sent: Thursday, June 16, 2016 5:10 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: Re: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> >
> > Ashutosh,
> >
> > First, I'm all for enabling general stride load/store support for
> > all targets --- should be just a matter of proper cost modeling. For
> > that matter, we should enable general gather/scatter support for all targets.
> >
> > About the specific approach taken by this RFC:
> > 1) It's best to first state that this is for small constant stride
> > where a single wide load/store can cover two or more
> > valid stride load/store elements. If a wide load/store contain
> > just one valid element, there aren't any need
> > to do a wide load/store.
> > 2) Optimal code generation for strided memory accesses are target
> dependent.
> > Breaking one gather/scatter w/ constant index vector into
> > several IR instructions like below might hurt such optimization,
> > especially when the vectorization factor is greater than the
> > target legal vector width ---- this can happen
> > in mixed data type code (int mixed with double, etc.)
> > 3) Make sure that the last element of unmasked load is a valid
> > stride load element ---- otherwise,
> > out-of-bounds SEGV can happen. Use of masked load intrinsic
> > most likely degrades any benefits over
> > gather intrinsic w/ constant index.
> > 4) Unfortunately, 3) most likely imply that the wide load used here
> > needs to be based on target HW physical
> > vector width. There are reasons why such type legalization at
> > vectorizer should be avoided.
> > 5) In general, I think leaking this much complexity into IR hurts
> > (rather than
> > enable) optimization.
> > It's best to keep vectorizer's output as simple as vector code
> > can be such that other optimizers can do their best
> > in optimizing vector code.
> >
> > If I remember correctly, Elena Demikhovsky (my colleague) attempted
> > to introduce stride load/store intrinsic sometime ago and got a push
> > back. I think it's best to address the following questions.
> > a) What can this enable over gather/scatter intrinsic w/ constant
> > index vector
> > b) What can this enable over hypothetical "stride load/store"
> > intrinsic
> > c) Any chance of introducing stride load/store as a form of wide load/store?
> > I understand that there is a big hurdle here, but the reward
> > should also be pretty good here.
> >
> > One might claim that the cost modeling can be easier, but that is
> > also the downside since the generated cost may not match the cost of
> > optimal
> code.
> >
> > All these things considered, my suggestion is to start from enabling
> > optimal code generation (or optimal lowering in CG prepare) from
> > gather/scatter intrinsic and reflect that in vectorizer's cost model.
> >
> > As a vectorizer centric person, I'd like to eventually see
> > gather/scatter/stride- load/store to be supported by IR instructions
> > (i.e., not intrinsics), but I'll leave that discussion to another
> > time so that I
> won't pollute Ashutosh's RFC.
> >
> > Thanks,
> > Hideki Saito
> > Technical Lead of Vectorizer Development Intel Compiler and
> > Languages
> >
> > -----------------------------------
> > Message: 7
> > Date: Wed, 15 Jun 2016 06:16:58 +0000
> > From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org>
> > To: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] [Proposal][RFC] Strided Memory Access
> > Vectorization
> > Message-ID:
> > <BLUPR12MB0644468A76C335E9B7431853FB550 at BLUPR12MB0644.na
> > mprd12.prod.outlook.com>
> >
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi,
> >
> > This is a proposal about implementing strided memory access vectorization.
> >
> > Currently LLVM does not support strided access vectorization
> > completely, some support is available via Interleave vectorization.
> >
> > There are two main overheads with strided vectorizations:
> > * An overhead of consolidating data into an operable vector.
> > * An overhead of distributing the data elements after the operations.
> >
> > Because of these overheads LLVM finds strided memory vectorization
> > is not profitable and generating serial code sequence over vectorized code.
> > GATHER & SCATTER instruction support is available on few(not all)
> > targets for consolidation & distribution operations.
> >
> > In this approach we are trying to handle cases like this:
> > for (int i = 0; i < len; i++)
> > a[i*3] = b[i*2] + c[i*3];
> >
> > We model strided memory load & store using shuffle & load/mask-store
> > operations.
> > * Load is modeled as loads followed by shuffle.
> > * Store is modeled as shuffle followed by mask store.
> > * To minimize load and store operation introduced 'SkipFactor'.
> >
> > 'SkipFactor':
> > * Multiple load operation required for consolidating data into an operable
> > vector.
> > * Between various loads we skip by few offsets to effective consolidate.
> > * SkipFactor is the number of additional offsets required to move from the
> > previous vector load memory-end address for the next vector load.
> > * SkipFactor allows all memory loads to be considered as identical (From
> > valid element perspective).
> > * SkipFactor = (Stride - (VF % Stride)) % Stride)
> >
> > For example:
> >
> >
> > How LOAD is modeled:
> > Load stride 3 (i.e. load for b [ 3 * i ])
> > %5 = getelementptr inbounds i32, i32* %b, i64 %.lhs
> > %6 = bitcast i32* %5 to <4 x i32>*
> > %stride.load27 = load <4 x i32>, <4 x i32>* %6, align 1, !tbaa !1
> > %7 = getelementptr i32, i32* %5, i64 6
> > %8 = bitcast i32* %7 to <4 x i32>*
> > %stride.load28 = load <4 x i32>, <4 x i32>* %8, align 1, !tbaa !1
> > %strided.vec29 = shufflevector <4 x i32> %stride.load27, <4 x i32>
> > %stride.load28, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
> >
> > How STORE is modeled:
> > Store with stride 3 (i.e. store to c [ 3 * i ])
> > %10 = getelementptr inbounds i32, i32* %c, i64 %.lhs
> > %11 = bitcast i32* %10 to <4 x i32>*
> > %interleaved.vec = shufflevector <4 x i32> %StoreResult, <4 x i32>
> > undef, <4 x
> > i32> <i32 0, i32 undef, i32 undef, i32 1>
> > call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec, <4
> > x
> > i32>* %11, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1 true>)
> > %12 = getelementptr i32, i32* %10, i64 6
> > %13 = bitcast i32* %12 to <4 x i32>*
> > %interleaved.vec30 = shufflevector <4 x i32> %StoreResult, <4 x
> > i32> undef,
> > <4 x i32> <i32 2, i32 undef, i32 undef, i32 3>
> > call void @llvm.masked.store.v4i32(<4 x i32> %interleaved.vec30,
> > <4 x i32>* %13, i32 4, <4 x i1> <i1 true, i1 false, i1 false, i1
> > true>)
> >
> > NOTE: For Stride 3 SkipFactor is 2, that's why the second GEP offset by 6(4+2).
> >
> > To enable this feature below LLVM changes required.
> > 1) Identify strided memory access (Its already there i.e. interleave
> vectorizer
> > does this)
> > 2) Costing changes:
> > a. Identify number of Load[s]/Store[s] & Shuffle[s] required to model
> > Load/Store operation by considering SkipFactor.
> > b. Return total cost by adding Load[s]/Store[s] and shuffle[s] costs.
> > 3) Transform:
> > a. Generate Shuffle[s] followed by Mask-Store[s] instructions to model a
> > Store operation.
> > b. Generate Load[s] followed by Shuffle[s] instructions to model a Load
> > operation.
> >
> > Implemented a working prototype to demonstrate this feature, its
> > available at below review-thread.
> > http://reviews.llvm.org/D21363
> >
> > Use below option to enable this feature:
> > "-mllvm -enable-interleaved-mem-accesses -mllvm -enable-strided-
> > vectorization"
> >
> > Gains observed with prototype:
> > TSVC kernel S111 1.15x
> > TSVC kernel S1111 1.42x
> >
> > If community is interested in the above work, I have below plan:
> >
> > I'm not sure keeping this feature separate or the part of current
> > interleave vectorizer ?
> >
> > 1) Will enable interleave vectorizer on X86 (If this feature is going as an
> > enhancement on interleave vectorizer)
> > 2) If it's going with interleave vectorizer, will modify its infra to support this.
> > 3) Will enable strided vectorization by submitting changes on costing &
> > transform.
> >
> > Please share your valuable thoughts.
> >
> > Thanks,
> > Ashutosh
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the llvm-dev
mailing list