[PATCH] D148068: [AArch64] Lower fused complex multiply-add intrinsic to AArch64::FCMA

Sun Apr 16 21:03:37 PDT 2023

nujaa added a comment.

> I'd be interested to see how the performance of this differs from what the ComplexDeinterleavingPass emits, or if the patterns aren't recognised by the pass, why that might be.

Hi, I realised your patch was not yet upstreamed when I created these changes explaining the schism. Also, the patterns would not be recognised anyway because MLIR does not support vectors of complex and our improvised lowering that works for our usecase does not generate them as shuffles + computation op. Which might be preferred.
So, Next steps are : I'll try generating complex operations as shuffle + computation ops as your implementation suggests and let you know of the performances.
To validate your implementation for my solution, I'll will also need to implement conjugate fusing and commutativity (as rotation only affects one operand and I need to be able to conjugate both operands to reach my own target performances). Eventually we could have something like

  define <4 x float> @complex_mul_v4f32(<4 x float> %a, <4 x float> %b) {
  ; CHECK-LABEL: complex_mul_v4f32:
  ; CHECK:       // %bb.0: // %entry
  ; CHECK-NEXT:    movi v2.2d, #0000000000000000
  ; CHECK-NEXT:    fcmla v2.4s, v0.4s, v1.4s, #0
  ; CHECK-NEXT:    fcmla v2.4s, v0.4s, v1.4s, #270
  ; CHECK-NEXT:    mov v0.16b, v2.16b
  ; CHECK-NEXT:    ret
  entry:
    %a.real = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 0, i32 2>
    %a.imag = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> <i32 1, i32 3>
    %b.real = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 0, i32 2>
    %b.imag = shufflevector <4 x float> %b, <4 x float> poison, <2 x i32> <i32 1, i32 3>
    %a.conj = fneg <2 x float> %a.imag
    %0 = fmul fast <2 x float> %b.imag, %a.real
    %1 = fmul fast <2 x float> %b.real, %a.conj
    %2 = fadd fast <2 x float> %1, %0
    %3 = fmul fast <2 x float> %b.real, %a.real
    %4 = fmul fast <2 x float> %a.conj, %b.imag
    %5 = fsub fast <2 x float> %3, %4
    %interleaved.vec = shufflevector <2 x float> %5, <2 x float> %2, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
    ret <4 x float> %interleaved.vec
  }

I am afraid of one thing, It is the explosion of usecases. Please tell me if I miss something or if I'm wrong but at LLVM level, there is a use case

- for each combination of sub/add (4 cases) (example is mul_mul_with_fneg in `complex-deinterleaving-uniform-cases.ll`)
- for negated operands similar to previous example but represented differently (2 cases) ex: `neg(a) x b; => fcmla a, b, #0; fcmla a, b, #270` [+ potentially neg(a) x neg(b) => axb]
- for conjugated operands (2 cases) (example above)  [+ potentially conj(a) x conj(b) => conj(axb)]

Which leads us to 16 cases multiplied by 2 if we take care of the commutativity. Maybe generating an intermediate complex multiplication  target specific ISD would help us hide out the combinations of sub/adds.

At asm level, recognising complex multiplication and fusing other operations becomes quite cumbersome because of the combinations of rotations and recognising common operands between vcmlas, we might want to avoid pattern matching there.
What do you think ?

Out of curiosity, where do you generate your shuffles from ?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148068/new/

https://reviews.llvm.org/D148068