[PATCH] D145578: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation

Fri Mar 10 02:41:35 PST 2023

dmgreen added a reviewer: vporpo.
dmgreen added a comment.

Hello. I had to remind myself where this came from. It looks like it was introduced in D123638 <https://reviews.llvm.org/D123638>, and there were some comments already about the performance not always being ideal. It apparently helped for some <2 x double> vectorization. I'm not sure if there it a perfect answer, but an effective cost of 2 for the throughput of the ld1r would seem to match the hardware better. This doesn't alter isLegalBroadcastLoad and the tests added in D123638 <https://reviews.llvm.org/D123638> don't seem to change.

The cost of a broadcast is already 1 so the code here doesn't seem like it would do much any more. It could be checking the cost type and returning 0 for costsize and 1 for throughput. Otherwise this bit of code could probably be removed.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll:86
 define void @slp_profitable_missing_fmf_on_fadd_fsub(ptr %A, ptr %B) {
 ; CHECK-LABEL: @slp_profitable_missing_fmf_on_fadd_fsub(
 ; CHECK-NEXT:    [[GEP_B_1:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i64 1
----------------
SjoerdMeijer wrote:
> ABataev wrote:
> > SjoerdMeijer wrote:
> > > SjoerdMeijer wrote:
> > > > Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that.
> > > > 
> > > > Hmmm.... I guess that then needs looking into.
> > > Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out.
> > Working on fma vectorization support in SLP, hope to spend more time on this later this month.
> It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution:
> 
>   [0,3]     D======eeeER   ..   fmul      s4, s1, s0
>   [0,4]     D======eeeER   ..   fmul      s1, s2, s1
>   [0,5]     D=======eeeER  ..   fmul      s5, s3, s2
>   [0,6]     D=======eeeER  ..   fmul      s0, s3, s0
>   [0,7]     D==========eeER..   fsub      s2, s4, s5
>   [0,8]     D==========eeER..   fadd      s0, s0, s1
> 
> especially if we have to do things like shuffles:
> 
>   [0,2]     D==eeeeeeeeER  .    .    .    ..   ld1r.2s    { v1 }, [x8], #4
>   [0,3]     D==========eeeeeeER .    .    ..   ldr        d2, [x8]
>   [0,4]     D================eeeER   .    ..   fmul.2s    v1, v1, v2
>   [0,5]     D================eeeER   .    ..   fmul.2s    v0, v2, v0[0]
>   [0,6]     D===================eeER .    ..   rev64.2s   v1, v1
>   [0,7]     D=====================eeER    ..   fsub.2s    v3, v0, v1
>   [0,8]     .D====================eeER    ..   fadd.2s    v0, v0, v1
>   [0,9]     .D======================eeER  ..   mov.s      v3[1], v0[1]
>   [0,10]    .D========================eeER..   str        d3, [x0]
>   [0,11]    .D========================eeeeER   st1.s      { v2 }[1], [x1]
> 
> Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation.
> 
> I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower.
> 
> @fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree?
I think we will fold the dup into the fmul as opposed to the load now, which seems a little cheaper. https://godbolt.org/z/4aeab8Pas
I'm not sure if that makes it cheaper overall though. I agree that the rev and the mov make the timing tight. From the look of the test it looks like "profitable" here just means that the non-fast version would be slightly more instructions, not that it was known to be profitable. i.e the test is for testing fma combining, this was just a negative test for that issue.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145578/new/

https://reviews.llvm.org/D145578