[PATCH] D114174: [ARM][CodeGen] Add support for complex addition and multiplication

Wed Feb 9 13:02:52 PST 2022

dmgreen added a comment.

In D114174#3307400 <https://reviews.llvm.org/D114174#3307400>, @NickGuy wrote:

> In D114174#3307302 <https://reviews.llvm.org/D114174#3307302>, @dnsampaio wrote:
>
>> ...
>> I'm not exactly sure why to do target specific pattern matching here. We could simply add generic complex intrinsics and the different patterns could be matched at each archs ISEL, no? I do agree that it should check in the backend if it should generate a complex operation of a given vector type.
>> ...
>> Is there any strong reason to create target specific intrinsics instead of having generic intrinsics with polymorphism? Such as llvm.complex.add.v2f32?
>
> I wanted to avoid adding target-independent intrinsics, as doing so might require substantial upstream consensus (Though this is being worked on in D119287 <https://reviews.llvm.org/D119287>). My thinking was that we can work on targeting our own architectures first, and then match incoming target-independent intrinsics after they are implemented. Additionally, IR intrinsics for Arm and AArch64 are already implemented, further reducing the amount of work required to enable initial support.

My understanding was that this had very little to do with upstream consensus and more to do with generating optimal code. I probably didn't say this clearly enough but as far as I see it this pass shouldn't be matching "complex multiply". It should be matching "partial multiply with rotate".

AArch64 (under certain architecture options) has an instruction that is called fcmla. (Arm MVE has similar instructions, which this pass is targeting at the moment). It takes a vector and a "rotate" that can be 0, 90, 180 or 270. It performs something like `(d0=d0+s0*t0, d1=d1+s0*t1)` for the odd/even lanes with a rotate of #0. Or `(d0=d0-s1*t1, d1=d1+s1*t0)` with a rotate of #90. You can combine two fcmla with rotations of 0 and 90 to produce a single complex multiply, but they can also be combined in all kinds of other weird and wonderful ways. There is no single "complex multiply" instruction for AArch64/MVE, and if you limit the optimizer to just acting on complex multiply, you are always going to be leaving performance on the table from all the other patterns that could be selected.

>> These difference can't be detected at isel?
>
> Unless I'm missing something, there aren't any differences to be detected by isel. The distinction only applies when generating the relevant IR intrinsic, whereas isel is responsible for substituting the IR intrinsic with the instruction/s (in my understanding).

This was written pre-ISel so that it could be shared between AArch64 and Arm MVE. It also makes looking at chunks of related instructions at the same time easier, in the same way as the MVE lane interleaving pass does (you look up to sources and down to sinks and whatnot).

>> In a more global view, how do you plan to generate such input patterns detected here?
>
> The patterns I've been focusing on are those emitted after loop vectorisation (which incidentally are also those emitted using std::complex with `-ffast-math`), because the MVE and Neon complex number instructions are vector instructions. Scalar patterns (like those in your linked snippet) are planned, but will require a bit more work to implement properly.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114174/new/

https://reviews.llvm.org/D114174