[PATCH] D142359: [TTI][AArch64] Cost model vector INS instructions

Wed Feb 15 07:07:04 PST 2023

SjoerdMeijer added a comment.

In D142359#4083490 <https://reviews.llvm.org/D142359#4083490>, @dmgreen wrote:

> The internal embedded benchmarks I tried had some pretty wild swings in both direction. I think it is worth working towards this, if we can try and minimize the regressions in the process. Running more benchmarks from SPEC and perhaps the llvm-test-suite would be good (maybe try and see what is going on in salsa for example? We might be getting the costs of scalar rotates/funnel shifts incorrect?)
>
> There might be quite a few other cases. I can try and provide some examples if I can extract them.

I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:

- As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,
- In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.
- Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,

For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
So, the way I look at this at the moment, is that this is more an enabler.

About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:

Before:

  54c16c:       2f00e400        movi    d0, #0x0
  54c170:       2f00e401        movi    d1, #0x0
  54c18c:       6d7f8c02        ldp     d2, d3, [x0, #-8]
  54c19c:       fc637964        ldr     d4, [x11, x3, lsl #3]
  54c1a0:       fc647965        ldr     d5, [x11, x4, lsl #3]
  54c1a4:       1f420080        fmadd   d0, d4, d2, d0
  54c1a8:       1f4304a1        fmadd   d1, d5, d3, d1

After:

  54e3c8:       6f00e400        movi    v0.2d, #0x0
  54e3e4:       3cc10601        ldr     q1, [x16], #16
  54e3f0:       fc627962        ldr     d2, [x11, x2, lsl #3]
  54e3f4:       8b030d63        add     x3, x11, x3, lsl #3
  54e3f8:       4d408462        ld1     {v2.d}[1], [x3]
  54e3fc:       4e61cc40        fmla    v0.2d, v2.2d, v1.2d

I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602 <https://reviews.llvm.org/D141602>, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.

In the meantime, I was curious if you had additional thoughts on this.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142359/new/

https://reviews.llvm.org/D142359