[PATCH] D142359: [TTI][AArch64] Cost model vector INS instructions
Sjoerd Meijer via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 15 07:07:04 PST 2023
SjoerdMeijer added a comment.
In D142359#4083490 <https://reviews.llvm.org/D142359#4083490>, @dmgreen wrote:
> The internal embedded benchmarks I tried had some pretty wild swings in both direction. I think it is worth working towards this, if we can try and minimize the regressions in the process. Running more benchmarks from SPEC and perhaps the llvm-test-suite would be good (maybe try and see what is going on in salsa for example? We might be getting the costs of scalar rotates/funnel shifts incorrect?)
>
> There might be quite a few other cases. I can try and provide some examples if I can extract them.
I have tried SPEC2017 INT and SPEC FP, and the LLVM test-suite:
- As we already knew, there's only 1 change in SPEC INT and that is x264 which is an uplift,
- In SPEC FP, there is 1 change, which is a minor regression in 510.parest. It's small, but it's definitely there.
- Nothing stands out in the llvm test-suite, and I don't see a regression in salsa,
For SPEC INT and FP, overall it is a tiny win, but I wouldn't object to calling it neutral. The uplift is 2.5% and the regression 1.5%.
So, the way I look at this at the moment, is that this is more an enabler.
About the regression, I looked at the 2 hottest functions in 510.parest which together are good for 55% of the runtime and I see this pattern repeated in different places in both functions:
Before:
54c16c: 2f00e400 movi d0, #0x0
54c170: 2f00e401 movi d1, #0x0
54c18c: 6d7f8c02 ldp d2, d3, [x0, #-8]
54c19c: fc637964 ldr d4, [x11, x3, lsl #3]
54c1a0: fc647965 ldr d5, [x11, x4, lsl #3]
54c1a4: 1f420080 fmadd d0, d4, d2, d0
54c1a8: 1f4304a1 fmadd d1, d5, d3, d1
After:
54e3c8: 6f00e400 movi v0.2d, #0x0
54e3e4: 3cc10601 ldr q1, [x16], #16
54e3f0: fc627962 ldr d2, [x11, x2, lsl #3]
54e3f4: 8b030d63 add x3, x11, x3, lsl #3
54e3f8: 4d408462 ld1 {v2.d}[1], [x3]
54e3fc: 4e61cc40 fmla v0.2d, v2.2d, v1.2d
I think this must be responsible for the minor regression. It is not terribly wrong, but I think it's just this high-latency LD1 variant that is not making this SLP vectorised code faster. This is funny, because we looked at the cost-modelling for this LD1 recently in D141602 <https://reviews.llvm.org/D141602>, but I had not spotted this LD1 here in 510.parest. I am curious to know why the SLP vectoriser thinks this is beneficial, so I will look at that.
In the meantime, I was curious if you had additional thoughts on this.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D142359/new/
https://reviews.llvm.org/D142359
More information about the llvm-commits
mailing list