[llvm] [AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (PR #81342)

Mon Feb 19 11:50:51 PST 2024

================
@@ -902,6 +904,921 @@ void MFMASmallGemmOpt::applyIGLPStrategy(
         SchedGroupMask::MFMA, 1, PipelineSyncID, DAG, TII);
     SG->initSchedGroup(SyncedInstrs[SG->getSyncID()]);
   }
+
+  return true;
+}
+
+class MFMAExpInterleaveOpt final : public IGLPStrategy {
+private:
+  SmallVector<SUnit *, 4> MFMAChainSeeds;
+  // Compute the heuristics for the pipeline, returning whether or not the DAG
+  // is well formatted for the mutation
+  bool analyzeDAG(const SIInstrInfo *TII);
+
+  /// Whether or not the instruction is a transitive predecessor of an MFMA
+  /// instruction
+  class IsPipeExp final : public InstructionRule {
+  public:
+    bool apply(const SUnit *SU, const ArrayRef<SUnit *> Collection,
+               SmallVectorImpl<SchedGroup> &SyncPipe) override {
+
+      auto DAG = SyncPipe[0].DAG;
+      auto TII = SyncPipe[0].TII;
+
+      if (Cache->empty()) {
+        auto I = DAG->SUnits.rbegin();
+        auto E = DAG->SUnits.rend();
+        for (; I != E; I++) {
+          if (TII->isMFMAorWMMA(*(I->getInstr())))
+            Cache->push_back(&*I);
+        }
+      }
+
+      if (Cache->empty())
+        return false;
+
+      auto Reaches = (std::any_of(
+          Cache->begin(), Cache->end(), [&SU, &DAG](SUnit *TargetSU) {
+            return DAG->IsReachable(TargetSU, const_cast<SUnit *>(SU));
+          }));
+
+      return Reaches;
+    }
+    IsPipeExp(const SIInstrInfo *TII, unsigned SGID, bool NeedsCache = false)
+        : InstructionRule(TII, SGID, NeedsCache) {}
+  };
+
+  /// Whether or not the instruction enables the exact MFMA that is the \p
+  /// Number th MFMA in the chain starting with \p ChainSeed
+  class EnablesNthMFMA final : public InstructionRule {
+  private:
+    unsigned Number = 1;
+
+  public:
+    bool apply(const SUnit *SU, const ArrayRef<SUnit *> Collection,
+               SmallVectorImpl<SchedGroup> &SyncPipe) override {
+      bool FoundTrans = false;
+      unsigned Counter = 1;
+      auto DAG = SyncPipe[0].DAG;
+
+      if (Cache->empty()) {
+        auto TII = SyncPipe[0].TII;
+        SmallVector<SUnit *, 8> Worklist;
+
+        auto I = DAG->SUnits.begin();
+        auto E = DAG->SUnits.end();
+        for (; I != E; I++) {
+          if (!FoundTrans) {
+            if (TII->isTRANS(I->getInstr()->getOpcode()))
+              FoundTrans = true;
+            continue;
+          } else {
+            if (TII->isMFMAorWMMA(*I->getInstr())) {
----------------
jrbyrnes wrote:

The basic stages of the CK pipeline for this fused attention kernels are:
	1. A series of MFMA chains.
	2. A series of V_EXP on the outputs of 1.
	3. Another series of MFMA chains on the ouputs of 2.

(see the IR in corresponding *.ll)

Since each EXP is dependent upon all MFMAs in stage 1 we cannot interleave those. However, each MFMA in stage 3 is only dependent on a subset of V_EXPs in stage 2, so we can interleave those so long as we carefully pick which EXP / MFMA go into a which "interleaving slot" (i.e. SchedGroup in the pipeline). In other words, the only MFMAs relevant to interleaving for the subset of kernels to which this is applied are the MFMAs that occur after the first V_EXP.

Some other contextual info:

In PreRA we can analyze the dependency relationships to understand exactly which V_EXPs are needed for which stage 3 MFMAs, thus we use EnablesNthMFMAInChain to match an exp which produces an exact MFMA (and IsExactMFMA to match the exact MFMA).

In PostRA, the dependency relationships are muddied via PhysReg so we cannot do the exact analysis. However, we assume that the PreRA mutations have been successful, thus the V_EXP and V_MFMA are already in the relative order we want. We just need to preserve that relative ordering; hence EnablesNthMFMA



https://github.com/llvm/llvm-project/pull/81342