[PATCH] D92208: [AArch64][CostModel] Fixed costs for mul <2 x i64>

Fri Nov 27 06:20:54 PST 2020

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:647
+    if (VecTy && IsInt64)
+      return 1 * VecTy->getNumElements() + VecTy->getNumElements();
+    return (Cost + 1) * LT.first;
----------------
SjoerdMeijer wrote:
> SjoerdMeijer wrote:
> > dmgreen wrote:
> > > Hmm. According this this it should have a cost around 8:
> > > https://godbolt.org/z/fjjEc7
> > > LT.first is the cost factor to get it to the MVE::v2i64 type. getScalarizationOverhead could be used to get that overhead.
> > > 
> > > What do you think of something like LT.first * (2 + 2*getScalarizationOverhead(extract) + getScalarizationOverhead(insert)) ? I'm not sure what cost that would give.
> > > Hmm. According this this it should have a cost around 8:
> > > https://godbolt.org/z/fjjEc7
> > 
> > I excluded the movs. In that link/example, the last two movs are for returning the vector, and the first 2 to shuffle arguments in place.
> > Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul.  Thus, for a <2 x i64> we would get 1 * 2 + 2 = 4, that's what I was trying to model here. What do you think?
> I meant this is what we do for one lane:
> 
> > Thus, the instruction cost I think are: 1 instruction for the lane extract, and 1 for scalar mul
> 
> so this * 2 for both lanes.
It would still have to get the vector over to integer registers for both inputs and put it back after. For vectors it would make sense to assume the values will be in vector regs (which in this case means two cross register bank copies). Something like this is acting the same: https://godbolt.org/z/M9h73n.

The cost should probably be high, as far as I understand. It's often worse to vectorize then scalaraze, as opposed to just keeping the original scalar code. And 8 would be OK for the number of instructions. Even if they are MOV's, cross-register bank copies are often expensive.

The extractvalue cost using mul is a little unfortunate, but that should probably be fixed separately if needed. There is also smull and umull which can handle 2 x i64 mul's, but it looks like isWideningInstruction does not handle them properly yet (and is always 0 by the look of it). Again they can be fixed by detecting the extends if needed.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D92208/new/

https://reviews.llvm.org/D92208