[PATCH] D137913: [X86] Rewrite `getScalarizationOverhead()`

Wed Dec 7 15:10:16 PST 2022

vdmitrie added inline comments.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:4470
+            if (!AffectedLanes[I] ||
+                (Lane == 0 && FullyAffectedLegalVectors[LegalVec]))
+              continue;
----------------
lebedev.ri wrote:
> vdmitrie wrote:
> > vdmitrie wrote:
> > > lebedev.ri wrote:
> > > > vdmitrie wrote:
> > > > > lebedev.ri wrote:
> > > > > > RKSimon wrote:
> > > > > > > vdmitrie wrote:
> > > > > > > > lebedev.ri wrote:
> > > > > > > > > vdmitrie wrote:
> > > > > > > > > > I just came across this patch and I'd like to ask some clarification question as I'd like to better understand the intent.
> > > > > > > > > > A quick remark before the question. In vectorizer term "lane" is typically mapped into individual vector element while when referencing memory/cache that is another thing. So in order to avoid confusion I will be referencing vector element explicitly and use "sub-vector" instead of "Lane" here.
> > > > > > > > > > 
> > > > > > > > > > Lets consider an example where we have 16 elements in of 32bits type each, like <16 x f32 >.
> > > > > > > > > > On avx512 architecture that will be just single legal vector which consists of four sub-vectors of < 4 x f32 > types.
> > > > > > > > > > Now consider two cases:
> > > > > > > > > > 1) we are building a full vector inserting each element of the legal vector
> > > > > > > > > > 2) we are inserting some of elements in every sub-vector in the legal vector (let it be each odd element for simplicity)
> > > > > > > > > > 
> > > > > > > > > > in the case 1, as we are building the entire vector, we don't have any upper elements to preserve so we don't really need insert there. So I assume that this is the case that supposed to be targeted (to be precise this should be not the entire vector build but rather all the upper sub-vectors build)
> > > > > > > > > > 
> > > > > > > > > > in case 2 we have to preserve some elements of upper sub-vectors. If I'm not mistaken It means that we still need to use insert for lowest sub-vector as well in order to preserve these elements (the upper part of the legal vector).
> > > > > > > > > > But as it done in this patch: AffectedLanes will be {1,1,1,1} as each sub-vector has a couple of elements to be inserted (preserving the other two). Derived form that FullyAffectedLegalVector[0] will also be '1' in that case and we will not account for  insert of the lowest sub-vector. Is than done intentionally or this is just a mistake?
> > > > > > > > > > Should we instead to check that we have nothing to preserve in any of the upper sub-vectors?
> > > > > > > > > > 
> > > > > > > > > > I just came across this patch and I'd like to ask some clarification question as I'd like to better understand the intent.
> > > > > > > > > 
> > > > > > > > > > A quick remark before the question. In vectorizer term "lane" is typically mapped into individual vector element while when referencing memory/cache that is another thing. So in order to avoid confusion I will be referencing vector element explicitly and use "sub-vector" instead of "Lane" here.
> > > > > > > > > >
> > > > > > > > > > Lets consider an example where we have 16 elements in of 32bits type each, like <16 x f32 >.
> > > > > > > > > On avx512 architecture that will be just single legal vector which consists of four sub-vectors of < 4 x f32 > types.
> > > > > > > > > > Now consider two cases:
> > > > > > > > > >
> > > > > > > > > > 1. we are building a full vector inserting each element of the legal vector
> > > > > > > > > > 2. we are inserting some of elements in every sub-vector in the legal vector (let it be each odd element for simplicity)
> > > > > > > > > 
> > > > > > > > > (How to boil water: pour water into kettle, boil.
> > > > > > > > > how to boil water in half-full kettle: remove water, and see original recipe.)
> > > > > > > > > 
> > > > > > > > > These two cases are equivalent, or i do not understand the question.
> > > > > > > > > Note that all of the logic here operates on XMM vectors.
> > > > > > > > > 
> > > > > > > > > As long as XMM subvector is touched by an element insertion,
> > > > > > > > > and not elements are overridden, we need to first extract said subvector,
> > > > > > > > > insert into it, and then insert the subvector back
> > > > > > > > > into the original legal vector.
> > > > > > > > > 
> > > > > > > > > In case 2., you said that every XMM subvector receives some elements,
> > > > > > > > > so what subvector would we need to preserve, if all of them are affected?
> > > > > > > > 
> > > > > > > > > (How to boil water: pour water into kettle, boil.
> > > > > > > > > how to boil water in half-full kettle: remove water, and see original recipe.)
> > > > > > > > 
> > > > > > > > Let's be serious and put the kettle aside)
> > > > > > > > 
> > > > > > > > > These two cases are equivalent, or i do not understand the question.
> > > > > > > > 
> > > > > > > > Not really. They are not equivalent IMO. See below.
> > > > > > > > 
> > > > > > > > > Note that all of the logic here operates on XMM vectors.
> > > > > > > > > 
> > > > > > > > > As long as XMM subvector is touched by an element insertion,
> > > > > > > > > and not elements are overridden, we need to first extract said subvector,
> > > > > > > > > insert into it, and then insert the subvector back
> > > > > > > > > into the original legal vector.
> > > > > > > > 
> > > > > > > > in case 1 we don't need to extract any of the upper sub-vectors. That's the point.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > In case 2., you said that every XMM subvector receives some elements,
> > > > > > > > > so what subvector would we need to preserve, if all of them are affected?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Look, the question here is whether we need to insert the lowest (0th) sub-vector back into the original vector. How you are going to write back to 0th sub-vector in the 512bits vector while preserving its upper part?  I hope it is clear why we need to preserve the upper part of the vector while inserting sub-vector 0. We need to insert an xmm into lower part of a zmm. Right? That is case 2
> > > > > > > > EVEX vmovapd xmm,xmm seems to preserve the upper part, so I think you are right, but I just wanted this to be confirmed explicitly that is what is assumed in this cost modeling.
> > > > > > > > 
> > > > > > > > Now for case1, when you have all the upper elements inserted (all of 4th through 15th) it means that we can set elements in xmm0 for example and than insert all upper sub-vectors (i.e. 1,2,3)  into zmm0. Right? So we really can omit insertion of the 0th sub-vector.
> > > > > > > > 
> > > > > > > @vdmitrie I'm struggling to follow to your description - maybe it would be easier if you create a godbolt link with code examples please?
> > > > > > +1 to what @RKSimon said.
> > > > > > Do we affect (insert into) every XMM subvector or not?
> > > > > > +1 to what @RKSimon said.
> > > > > > Do we affect (insert into) every XMM subvector or not?
> > > > > 
> > > > > yes.  I'd like to avoid using term XMM for every sub-vector because an xmm register actually represents only sib-vector 0.
> > > > > 
> > > > > Here is what I was trying to describe:
> > > > > {w, p, w, p} {w, p, w, p} {w, p, w, p} {w, p, w, p}
> > > > >  subv3           subv2        subv1          subv0
> > > > > 
> > > > > "w" - we write to a 512bits vector element, "p" - we preserve the original value of an element.
> > > > > 
> > > > > 
> > > > As you noted, in the code here, lane means 128-bit subvector.
> > > > In that example, every 128-bit subvector of the legal vector is being affected,
> > > > which means we need to extract every every 128-bit subvector of the legal vector,
> > > > and afterwards the situation becomes the same as in the example 1.
> > > > Kettle example is relevant here.
> > > > As you noted, in the code here, lane means 128-bit subvector.
> > > > In that example, every 128-bit subvector of the legal vector is being affected,
> > > > which means we need to extract every every 128-bit subvector of the legal vector,
> > > > and afterwards the situation becomes the same as in the example 1.
> > > > Kettle example is relevant here.
> > > 
> > > So, you mean that we first do extracts and keep every sub-vector on it's own xmm register?
> > > That assumes increased register pressure. So "the kettle" is irrelevant in this regard :)
> > > 
> > > @vdmitrie I'm struggling to follow to your description - maybe it would be easier if you create a godbolt link with code examples please?
> > 
> > https://godbolt.org/z/5qWv4TPq8
> > So, you mean that we first do extracts and keep every sub-vector on it's own xmm register?
> 
> That is the cost model, yes.

> That is the cost model, yes.

It does not seem to reflect actual code generation. So it is kind of cheating.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137913/new/

https://reviews.llvm.org/D137913