[PATCH] D142359: [TTI][AArch64] Cost model vector INS instructions

Thu Feb 16 05:10:11 PST 2023

dmgreen added a comment.

> I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:
>
> - As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,
> - In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.
> - Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,
>
> For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
> So, the way I look at this at the moment, is that this is more an enabler.
>
> About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:
>
> Before:
>
>   54c16c:       2f00e400        movi    d0, #0x0
>   54c170:       2f00e401        movi    d1, #0x0
>   54c18c:       6d7f8c02        ldp     d2, d3, [x0, #-8]
>   54c19c:       fc637964        ldr     d4, [x11, x3, lsl #3]
>   54c1a0:       fc647965        ldr     d5, [x11, x4, lsl #3]
>   54c1a4:       1f420080        fmadd   d0, d4, d2, d0
>   54c1a8:       1f4304a1        fmadd   d1, d5, d3, d1
>
> After:
>
>   54e3c8:       6f00e400        movi    v0.2d, #0x0
>   54e3e4:       3cc10601        ldr     q1, [x16], #16
>   54e3f0:       fc627962        ldr     d2, [x11, x2, lsl #3]
>   54e3f4:       8b030d63        add     x3, x11, x3, lsl #3
>   54e3f8:       4d408462        ld1     {v2.d}[1], [x3]
>   54e3fc:       4e61cc40        fmla    v0.2d, v2.2d, v1.2d
>
> I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602 <https://reviews.llvm.org/D141602>, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.
>
> In the meantime, I was curious if you had additional thoughts on this.

I was expecting that to be similar to the other fmul+fadd vs fma issue we have seen elsewhere, but I'm not sure it is. Does it reduce the value to a single element again?

As for a few examples, this case is a little worse. I'm not sure if it is using bad costs, but the slp vectorization seems to reach through a phi and the result when put through llc is mostly just more instructions: https://godbolt.org/z/z6hnEcPG1.
Another case which is perhaps simpler is this "distance" one from cmsisdsp: https://godbolt.org/z/1xz7GP3fM. It looks like something might be being scalarized, but again I've not looked into the details.

There are some nice improvements too, if we can get the regressions hopefully fixed.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142359/new/

https://reviews.llvm.org/D142359