[PATCH] D142656: [SVE][codegen] Add pattern for SVE multiply-add accumulate

Fri Feb 3 15:06:04 PST 2023

SjoerdMeijer added a comment.

There is no precedent for:

- matching opcodes like that, the latest revision adds the splat_vector opcode check,
- checking the uses of sdnodes like that.

And the complexity check and interaction isn't a good sign. I don't think we are  yet converging to a better implementation, and thus I believe this is not the right place for this. I still think this is a job for the machine combiner, this is where all the precedent is for doing these kind of things. Previously I linked to some scalar combines, but here all are the vector opcodes:

https://github.com/llvm/llvm-project/blob/5bc4e1cd3be8844c1e7947647020710016925e9e/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp#L5032

For test-case `muladd_i8_negativeAddend`, this is the relevant MIR snippet:

  %3:zpr = MUL_ZPZZ_UNDEF_B killed %2:ppr_3b, %0:zpr, %1:zpr
  %4:zpr = ADD_ZI_B %3:zpr(tied-def 0), 241, 0

And this looks like a fairly straight forward combine to me, and again, all the precedent is there in the machine combiner to do this. However, the benefit isn't clear to me for this example, because the assembly for this example before this patch:

  mul     z0.b, p0/m, z0.b, z1.b
  add     z0.b, z0.b, #241                // =0xf1

And when we want to generate the mad, we need to splat the immediate and we end up with something like this:

  mov 
  mad

So there's no real gain here if I am not mistaken? That makes me wonder for which cases we do expect and actually see a gain? Do we have performance data for some benchmarks for this?
And for most other cases `sve-multiply-add-accumulate.ll`, mla was generated which now turns into mad, but similarly the gain is unclear. By the way, would be great if you locally rebase this patch on the "precommit" test cases in D142998 <https://reviews.llvm.org/D142998>,. Then this would only see the  changes in the tests, similarly like I showed in my previous comment, e.g.:

    @@ -45,10 +45,11 @@ define <vscale x 8 x i16> @muladd_i16_test1(<vscale x 8 x i16> %a, <vscale x 8 x
   define <vscale x 8 x i16> @muladd_i16_test2(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b)
   ; CHECK-LABEL: muladd_i16_test2:
   ; CHECK:       // %bb.0:
  +; CHECK-NEXT:    mov w8, #200
  +; CHECK-NEXT:    mov z2.d, z0.d
   ; CHECK-NEXT:    ptrue p0.h
  -; CHECK-NEXT:    movprfx z2, z0
  -; CHECK-NEXT:    mul z2.h, p0/m, z2.h, z1.h
  -; CHECK-NEXT:    add z2.h, z2.h, #200 // =0xc8
  +; CHECK-NEXT:    mov z3.h, w8
  +; CHECK-NEXT:    mad z2.h, p0/m, z1.h, z3.h
   ; CHECK-NEXT:    mul z0.h, p0/m, z0.h, z2.h
   ; CHECK-NEXT:    sub z0.h, z0.h, z1.h
   ; CHECK-NEXT:    ret

But this is a bit of an aside, at this point it would be great if I could get a second opinion on the direction, maybe @dmgreen , @david-arm ?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142656/new/

https://reviews.llvm.org/D142656