[PATCH] D103629: [AArch64] Cost-model i8 vector loads/stores

Fri Jul 2 09:00:03 PDT 2021

dmgreen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1407

-  if (useNeonVector(Ty) &&
+  bool IsTruncOrExt = VT != LT.second;
+  if (IsTruncOrExt && useNeonVector(Ty) &&
----------------
SjoerdMeijer wrote:
> dmgreen wrote:
> > I think you are adding too many types. It's not a truncating store every time that VT != LT.second. It just means that it is type legalized to a different type.
> > A v32i8 is just 2 v16i8 loads, so should give LT={2, MVT::v16i8} and should get a cost of 2 (which is LT.first). Any larger type follows the same pattern of splitting until it is a legal type. So v64i8 would be LT={4, MVT::v16i8}, so cost=4.
> > v16i8 and v8i8 are legal, so get {1, MVT::v16i8} or {1, MVT::v8i8} and cost 1 (which is LT.first again).
> > For v4i8 the LT.second will be v4i16 and LT.first will be 1 still (I think).
> > 
> > I was expecting it to just be <4 x i8> costs, which are now 2 (or 3 if we want to worry about big endian costs, which we don't usually do AFAIU). One for the f32 load and one for the sshll (or xtn+f32 store). The cost of any other truncation would be measured from the trunc instruction.
> Okay, that's fair enough.
> 
> I don't like this `ProfitableNumElements` business here, and would like to see that go.
> It's wrong for the loads, so will have to adjust that. And when I do that, I don't want to add a sort of special case for the loads and keep the `ProfitableNumElements` for the stores and that logic for the `getNumElements() < ProfitableNumElements` business. So that explains that a little bit.
> 
> I think we have 2 cases:
> - A type is legal, or a type is legal and is split up. In both cases, like you explained the cost is just `LT.first`.
> - A type is truncated or extended. 
> 
> For an extend of `<4 x i8>` the cost is 2 if it's extended to i16 (1 load, 1 sshll), and the cost is 3 if it is extended to i32 (1 load, 2 sshll). So the cost is not 2 in all cases. And while we are at I added costs for some smaller and larger vectors.
> 
> The check to see if it's a trunk/extend is indeed wrong. I will need to look but I am guessing that checking if `getScalarSizeInBits()` is the same for `VT` or `LT.second` or something along those line will cover that. This allows us to do a lookup for a trunc/extend (case 2), or return `LT.second`(case 1).
I still think there are more types here than should be needed (or possible to reach).

The truncation tests you point to are not very relevant here. We are just dealing with the cost of a load. As in these tests: https://godbolt.org/z/oK58Gcez5
(If you want to add costs for a combines store(trunc) or ext(load) then that is a different issue, that needs to detect the trunc/ext instruction, like we do in MVE. I don't think it's needed here though).

For i8 vector types that are a different size when legalized, we are either talking about v2i8, v3i8 or v4i8. I'm pretty sure anything larger will be legalized to a v8i8 or larger, so still a i8. v4i8 is the one you added better lowering for recently, It will be legalized to a v4i16. v2i8 will legalize to a v2i32 and is still a bit messy.

So I think loads and stores are roughly the same cost, and most of the cases are not needed. The new code can probably be something like:
```
  if (useNeonVector(Ty) &&
      Ty->getScalarSizeInBits() != LT.second.getScalarSizeInBits()) {
    // v4i8 types are lowered to scalar load/store and sshl/xtn.
    if (VT == MVT::v4i8)
      return 2;
    // Otherwise we need to scalarize
    return cast<FixedVectorType>(Ty)->getNumElements() * 2;
  }
```

I think that may get v2i16 costs more correct too, which is a nice benefit :)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103629/new/

https://reviews.llvm.org/D103629