[PATCH] D100099: [X86][CostModel] Try to fix cost computation load/stores of non-power-of-two vectors

Thu Apr 15 09:47:48 PDT 2021

RKSimon added a comment.

In D100099#2691945 <https://reviews.llvm.org/D100099#2691945>, @lebedev.ri wrote:

> This is probably still impresice for small remainder sub-vectors.
> E.g. load cost for `<3 x float>` w/ 8 byte alignment should be 1: https://godbolt.org/z/r3ncvMvaf

That assumes 16-byte alignment though - for v3f32 the more likely (element) 4-byte alignment will be worse - not sure whether we have much difference alignment coverage (or whether its worth it).

The costs are a lot better than what they were, but there are a few cases that seem off - if I had to guess I'd say we're not completely getting the split types matching the legalized types?

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:3222-3230
+      SmallVector<unsigned, CHAR_BIT * sizeof(NumElem)> Factors;
+      for (unsigned Bit = 0; Bit != CHAR_BIT * sizeof(NumElem); ++Bit) {
+        unsigned Factor = unsigned(1) << Bit;
+        if (NumElem & Factor)
+          Factors.emplace_back(Factor);
+      }
+      assert(std::accumulate(Factors.begin(), Factors.end(), unsigned(0)) ==
----------------
lebedev.ri wrote:
> lebedev.ri wrote:
> > RKSimon wrote:
> > > ABataev wrote:
> > > > Why not just something like this:
> > > > ```
> > > > unsigned Factor = 0;
> > > > for (; NumElem > 0; NumElem -= Factor) {
> > > >   Factor = PowerOf2Floor(NumElem);
> > > >   .....
> > > > }
> > > > ```
> > > +1 Having Factor updated in the condition as well as being used in increment block is difficult to grok
> > Note that i have addressed @ABataev's comment, it was about earlier patch version:
> > https://reviews.llvm.org/D100099?id=336063#change-OQEVJvBxQbDZ
> ... or are you telling to move `Factor` computation from the condition?
If you can, but it looks like making the dependencies between NumElemLeft and Factor simpler won't be easy - I'm happy for you to keep it as.

================
Comment at: llvm/test/Analysis/CostModel/X86/load_store.ll:26
 ; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: store <3 x i64> undef, <3 x i64>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 13 for instruction: store <5 x i32> undef, <5 x i32>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: store <5 x i64> undef, <5 x i64>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: store <5 x i16> undef, <5 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: store <6 x i16> undef, <6 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: store <7 x i16> undef, <7 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: store <11 x i16> undef, <11 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 24 for instruction: store <12 x i16> undef, <12 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: store <13 x i16> undef, <13 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 46 for instruction: store <23 x i16> undef, <23 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 48 for instruction: store <24 x i16> undef, <24 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 50 for instruction: store <25 x i16> undef, <25 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 94 for instruction: store <47 x i16> undef, <47 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 96 for instruction: store <48 x i16> undef, <48 x i16>* undef, align 4
-; SSE-NEXT:  Cost Model: Found an estimated cost of 98 for instruction: store <49 x i16> undef, <49 x i16>* undef, align 4
+; SSE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: store <5 x i32> undef, <5 x i32>* undef, align 4
+; SSE-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: store <5 x i64> undef, <5 x i64>* undef, align 4
----------------
cost = 2 ? SSE max size is 128-bit vector - so no subvector extraction cost - <3 x double> seems to get it right?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100099/new/

https://reviews.llvm.org/D100099