[PATCH] D114174: [ARM][CodeGen] Add support for complex addition and multiplication

Sun Mar 27 10:12:09 PDT 2022

dmgreen added a comment.

OK - we can keep this going - we know it should give some decent performance. We can see where we go from there if something else materializes that can give the same gains.

At a high level - what this pass is doing either seems to be too complicated or too simple. What it effectively seems to do is pattern match `shuffle(fadd/fmul/etc(shuffle, shuffle))` is a way that matches one of several patterns, and generates target intrinsics from that. If that is all that is needed then the "Graph" that gets created doesn't seem necessary. It only ever stores a single node, so isn't much of a graph. It is being more complicated than it needs to be.

But I imagine in practice that a lot of math routines will do more than a single complex multiply. There will be multiple parts, this should be building a graph that is made up of several nodes bit at a time and generating code from it only if it can create a full graph that looks profitable to generate.

There are a number of other comments below.

================
Comment at: llvm/include/llvm/CodeGen/ComplexArithmetic.h:1
+//===- ComplexArithmetic.h - Complex Arithmetic Pass --------*- C++ -*-===//
+//
----------------
ComplexArithmetic is a bit of a general name for what this pass is trying to achieve. What about something like ComplexDeinterleavingPass? It better explains that the pass is performing the deinterleaving from shuffles.

================
Comment at: llvm/include/llvm/CodeGen/ComplexArithmetic.h:20
+
+#ifndef ENUM_FLAG_DEF
+#define ENUM_FLAG_DEF(Type)                                                    \
----------------
Is this needed? Its strange to a macro like this and I'm not sure it's very useful for the places it is used.

================
Comment at: llvm/include/llvm/CodeGen/ComplexArithmetic.h:59
+
+ENUM_FLAG_DEF(ComplexArithmeticOperation)
+
----------------
It doesn't seem that it would be necessary to combine ComplexArithmeticOperation operations together. It doesn't make much sense to have something that is both an Add _and_ a Partial Mul.

================
Comment at: llvm/include/llvm/CodeGen/ComplexArithmetic.h:61
+
+struct ComplexArithmeticData {
+public:
----------------
Is this just needed to pass data between the pass and the TLI? It's probably simpler (and more like the existing methods) to just pass the parameters individually.

================
Comment at: llvm/include/llvm/CodeGen/Passes.h:85

+  //===----------------------------------------------------------------------===//
+  //
----------------
I think this line can be removed. The other comments here are /// doxygen comments.

================
Comment at: llvm/include/llvm/CodeGen/TargetLowering.h:2963

+  virtual bool supportsComplexArithmetic() const {
+    return false;
----------------
This isn't needed if the target adds the pass itself. There will probably need to be a way to specify _which_ patterns the target supports for a given type. For example MVE has both integer and floating point complex operations. If the subtarget has only MVE (not MVE.fp), then it needs to support the integer complex operations, without supporting floating point. Other differences could exist like one architecture supporting a different subset of operations.

================
Comment at: llvm/include/llvm/CodeGen/TargetLowering.h:2969
+                                           Value *InputA, Value *InputB,
+                                           int &GeneratedIntrinsicCount) const {
+    return nullptr;
----------------
Is GeneratedIntrinsicCount just for statistics? If so it doesn't seem to be worth complicating the interface for.

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:40
+static ArrayRef<int> createContiguousMask(int len, int offset = 0) {
+  int *Arr = new int[len];
+  for (int j = 0; j < len; j++)
----------------
This leaks memory. Same for the ones below. They can probably return a SmallVector<int>.

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:107
+
+    Real = 1,
+    Imaginary = 2,
----------------
Why are Real and Imaginary needed?

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:110
+    Load = 4,
+    Store = 8,
+    Shuffle = 16,
----------------
Loads and Stores don't appear in the graph.

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:112
+    Shuffle = 16,
+    AddOperand = 32,
+    Input = 64,
----------------
Why is AddOperand needed? What does it mean?

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:113-114
+    AddOperand = 32,
+    Input = 64,
+    Preserve = 128,
+
----------------
How is Preserve different to Input?

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:116
+
+    // Meta node types, defining additional behaviour upon node creation
+
----------------
What does this refer to?

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:119-120
+    /**
+     * Will cause the node to look at parents to try and identify the type.
+     * Parents must already be registered and identified.
+     */
----------------
What does this mean and refer to?

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:321
+
+static bool HasBeenDisabled = false;
+bool ComplexArithmetic::runOnFunction(Function &F) {
----------------
This looks like debugging code that should be removed. There is already an option to disable it.

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:660-661
+
+    //    if (TLI->matchComplexArithmeticIR(SVI, G))
+    //      return true;
+  }
----------------
Remove commented out code.

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:703
+
+  const unsigned MaxVectorWidth = 128;
+
----------------
This shouldn't be something that is hard-coded in the pass. It might make more sense to push the splitting up of larger types to the target. There are cases like fixed-length SVE where the width won't be limited to 128bit vectors, and pushing that out to the target will be more general.

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:804
+    }
+    if (Changed) {
+      LLVM_DEBUG(dbgs() << "Trying to substitute graph in block: \n";
----------------
This logic is all a bit strange. Why have Changed?
```
  for (auto &I : *B) {
    if (auto *SVI = dyn_cast<ShuffleVectorInst>(&I)) {
      if (isInterleaving(SVI))
        if (traverseAndPopulateGraph(TLI, SVI, Graph))
          Substituted |= substituteGraph(TLI, &I, Graph, DeadInsts);
```

================
Comment at: llvm/lib/CodeGen/ComplexArithmetic.cpp:820
+
+  // TODO clean up the dead instructions better. (Ask in review?)
+  unsigned iter = 0;
----------------
I think this is just RecursivelyDeleteTriviallyDeadInstructions

================
Comment at: llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp:15
 #include "llvm/Analysis/LoopInfo.h"
+#include "llvm/CodeGen/ComplexArithmetic.h"
 #include "llvm/CodeGen/CostTable.h"
----------------
These are probably unneeded?

================
Comment at: llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll:2
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -o - | FileCheck %s
+; RUN: llc < %s -o /dev/null -stats -stats-json 2>&1 | FileCheck %s --check-prefix=STATS
----------------
I would use run lines that that are similar to the llvm/test/CodeGen/Thumb2/mve-xyz.ll tests. Something like:
```
; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs %s -o - | FileCheck %s
```
It's best not to use a specific cpu, using the arch instead.

The tests can probably go in the same place too, under llvm/test/CodeGen/Thumb2/mve-xyz.ll for mve tests.

================
Comment at: llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll:5
+
+; STATS: "complex-arithmetic.NumComplexIntrinsics": 3
+
----------------
This probably isn't worth testing if it is testing codegen already.

================
Comment at: llvm/test/CodeGen/ARM/ComplexArithmetic/complex-arithmetic-f16-add.ll:33
+vector.body:                                      ; preds = %vector.body, %entry
+  %wide.vec = load <2 x half>, <2 x half>* %a, align 4
+  %strided.vec = shufflevector <2 x half> %wide.vec, <2 x half> zeroinitializer, <1 x i32> <i32 0>
----------------
A lot of these tests are strange - they seem to have infinite loops?

I think you should be able to remove most of it to make a much cleaner test. It doesn't even need the loads/stores:
```
define <8 x half> @complex_add_v8f16(<8 x half> %a, <8 x half> %b) {
entry:
  %strided.vec = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec21 = shufflevector <8 x half> %a, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %strided.vec23 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %strided.vec24 = shufflevector <8 x half> %b, <8 x half> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  %0 = fsub fast <4 x half> %strided.vec23, %strided.vec21
  %1 = fadd fast <4 x half> %strided.vec24, %strided.vec
  %interleaved.vec = shufflevector <4 x half> %0, <4 x half> %1, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
  ret <8 x half> %interleaved.vec
}

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114174/new/

https://reviews.llvm.org/D114174