[PATCH] D145578: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation

Wed Mar 8 07:16:40 PST 2023

SjoerdMeijer added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll:86
 define void @slp_profitable_missing_fmf_on_fadd_fsub(ptr %A, ptr %B) {
 ; CHECK-LABEL: @slp_profitable_missing_fmf_on_fadd_fsub(
 ; CHECK-NEXT:    [[GEP_B_1:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i64 1
----------------
ABataev wrote:
> SjoerdMeijer wrote:
> > SjoerdMeijer wrote:
> > > Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that.
> > > 
> > > Hmmm.... I guess that then needs looking into.
> > Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out.
> Working on fma vectorization support in SLP, hope to spend more time on this later this month.
It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution:

  [0,3]     D======eeeER   ..   fmul      s4, s1, s0
  [0,4]     D======eeeER   ..   fmul      s1, s2, s1
  [0,5]     D=======eeeER  ..   fmul      s5, s3, s2
  [0,6]     D=======eeeER  ..   fmul      s0, s3, s0
  [0,7]     D==========eeER..   fsub      s2, s4, s5
  [0,8]     D==========eeER..   fadd      s0, s0, s1

especially if we have to do things like shuffles:

  [0,2]     D==eeeeeeeeER  .    .    .    ..   ld1r.2s    { v1 }, [x8], #4
  [0,3]     D==========eeeeeeER .    .    ..   ldr        d2, [x8]
  [0,4]     D================eeeER   .    ..   fmul.2s    v1, v1, v2
  [0,5]     D================eeeER   .    ..   fmul.2s    v0, v2, v0[0]
  [0,6]     D===================eeER .    ..   rev64.2s   v1, v1
  [0,7]     D=====================eeER    ..   fsub.2s    v3, v0, v1
  [0,8]     .D====================eeER    ..   fadd.2s    v0, v0, v1
  [0,9]     .D======================eeER  ..   mov.s      v3[1], v0[1]
  [0,10]    .D========================eeER..   str        d3, [x0]
  [0,11]    .D========================eeeeER   st1.s      { v2 }[1], [x1]

Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation.

I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower.

@fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D145578/new/

https://reviews.llvm.org/D145578