[llvm] [AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (PR #81342)
Jeffrey Byrnes via llvm-commits
llvm-commits at lists.llvm.org
Mon Feb 19 11:49:00 PST 2024
================
@@ -902,6 +904,921 @@ void MFMASmallGemmOpt::applyIGLPStrategy(
SchedGroupMask::MFMA, 1, PipelineSyncID, DAG, TII);
SG->initSchedGroup(SyncedInstrs[SG->getSyncID()]);
}
+
+ return true;
+}
+
+class MFMAExpInterleaveOpt final : public IGLPStrategy {
+private:
+ SmallVector<SUnit *, 4> MFMAChainSeeds;
+ // Compute the heuristics for the pipeline, returning whether or not the DAG
+ // is well formatted for the mutation
+ bool analyzeDAG(const SIInstrInfo *TII);
+
+ /// Whether or not the instruction is a transitive predecessor of an MFMA
+ /// instruction
+ class IsPipeExp final : public InstructionRule {
+ public:
+ bool apply(const SUnit *SU, const ArrayRef<SUnit *> Collection,
+ SmallVectorImpl<SchedGroup> &SyncPipe) override {
+
+ auto DAG = SyncPipe[0].DAG;
+ auto TII = SyncPipe[0].TII;
+
+ if (Cache->empty()) {
+ auto I = DAG->SUnits.rbegin();
+ auto E = DAG->SUnits.rend();
+ for (; I != E; I++) {
+ if (TII->isMFMAorWMMA(*(I->getInstr())))
+ Cache->push_back(&*I);
+ }
+ }
+
+ if (Cache->empty())
+ return false;
+
+ auto Reaches = (std::any_of(
+ Cache->begin(), Cache->end(), [&SU, &DAG](SUnit *TargetSU) {
+ return DAG->IsReachable(TargetSU, const_cast<SUnit *>(SU));
+ }));
+
+ return Reaches;
+ }
+ IsPipeExp(const SIInstrInfo *TII, unsigned SGID, bool NeedsCache = false)
+ : InstructionRule(TII, SGID, NeedsCache) {}
+ };
+
+ /// Whether or not the instruction enables the exact MFMA that is the \p
+ /// Number th MFMA in the chain starting with \p ChainSeed
----------------
jrbyrnes wrote:
There are dependencies among the (stage 3, see below) MFMAs themselves; there are running sums of MFMAs.
e.g.
MFMA[0,0] = MFMA A00, B00, C
MFMA[0,1] = MFMA A01, B01, MFMA[0,0]
...
MFMA[1,0] = MFMA A10, B10, C
MFMA[1,1] = MFMA A11, B11, MFMA[1,0]
...
...
MFMAChainSeed would be those MFMAs that do not have an MFMA predecessor (MFMA[i,0] in this example). The chain length would be the max value of j + 1 in MFMA[0,j].
As another example, if there are 16 MFMAs and none of them depend on each other, we will have 16 MFMAChainSeeds.
I think if there was documentation on this pattern it would exist somewhere in CK materials regarding the implementation of the fused_attention operation.
In terms of interleaving, this means that we must put MFMA[0,0] into an earlier MFMA slot than MFMA[0,1]. We must also be careful to put MFMA[0,1]'s V_EXP predecessors in the correct slots as well. Ideally, the greedy PipelineSolver would be able to figure all this stuff out, but it can't.
https://github.com/llvm/llvm-project/pull/81342
More information about the llvm-commits
mailing list