[PATCH] D114174: [ARM][CodeGen] Add support for complex addition and multiplication

Tue Jun 28 10:14:06 PDT 2022

dmgreen added a comment.

In D114174#3601263 <https://reviews.llvm.org/D114174#3601263>, @NickGuy wrote:

> In D114174#3601236 <https://reviews.llvm.org/D114174#3601236>, @dmgreen wrote:
>
>> What happened to the ideas of starting from a shuffle and working through a worklist of pairs of [Real, Imaginary] Values that we match to complex nodes?
>
> While that would work "simply" enough for cases like `a * b`, more elaborate cases (e.g. `(a * b) * c`) would result in some ambiguity as to which `add(mul, mul)` pairs with which `sub(mul, mul)`. This complexity inflates further the more operands are involved. The approach I've implemented here considers each complex node in isolation (with the exception of the accumulator value).

Hmm, OK. I was thinking about a three way multiply when looking at that way of structuring the pass, but only one pattern and only ever on paper. I was worried that there were cases where it was ambiguous, but figured if it was it could always just try both possibilities. But I've not implemented it, it just sounded like an elegant way of treating this like slightly more complex pattern matcher, as opposed to all this findUnmatchedInstructions and looking through uses.

A broad overview of the algorithm would be useful, like Momchil mentioned, perhaps in the file description. I was trying to make a sketch before but got a bit lost in nested loops.

>> I presume the == 128 can be removed if we teach it how to split the vectors up?I presume the == 128 can be removed if we teach it how to split the vectors up?
>
> Yep, vector splitting is something I decided to push out of this initial patch, and will be implemented in a follow-up. (Due to each node being treated in isolation, the vector splitting from previous iterations got overly complicated and unwieldy). The ideal solution that I can see would be to teach the intrinsic how to split, rather than the pass (somewhere like `DAGTypeLegalizer::SplitVectorResult`).

A separate addition sounds good. Lets try and get something that we happy with and extend it out.

>> Is this cost necessary at the moment, or will it always be profitable for the simple cases?
>
> Complex add has a few cases where I've observed performance regressions, and that's the sort of thing this rudimentary cost-modelling is intended to catch.

Do you have a test-case where this happens? Cost modelling will probably be good to add in the long run, I'd just like to understand where it currently can make things worse.

================
Comment at: llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f32-add.ll:99
+  %b.imag = shufflevector <8 x float> %b, <8 x float> zeroinitializer, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+  %0 = fsub fast <4 x float> %b.real, %a.imag
+  %1 = fadd fast <4 x float> %b.imag, %a.real
----------------
chill wrote:
> chill wrote:
> > Shouldn't these be translated  to a couple of `vcadd.f32` instructions, like in the previous test?
> > And this amount of move instructions seems excessive.
> > And this amount of move instructions seems excessive.
> I guess MVE does not have sensible swizzling instructions.
> 
Yeah this is expected from shuffles that MVE cannot handle very well. It would look a lot better either with loads that could become interleaving loads, or under AArch64 where better shuffles are available.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114174/new/

https://reviews.llvm.org/D114174