[PATCH] D124678: [AMDGPU] Allow for MFMA Inst Clustering

Wed May 4 09:41:47 PDT 2022

kerbowa added inline comments.

================
Comment at: llvm/test/CodeGen/AMDGPU/mfma-cluster.mir:154
+    ; BOTHSCHEDPASS-NEXT: $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_4X4X1F32_e64 $vgpr1, $vgpr0, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
+    ; BOTHSCHEDPASS-NEXT: $vgpr2 = V_MOV_B32_e32 1, implicit $exec
+    ; BOTHSCHEDPASS-NEXT: $agpr4_agpr5_agpr6_agpr7 = V_MFMA_F32_4X4X1F32_e64 $vgpr3, $vgpr4, killed $agpr4_agpr5_agpr6_agpr7, 0, 0, 0, implicit $mode, implicit $exec
----------------
jrbyrnes wrote:
> rampitec wrote:
> > jrbyrnes wrote:
> > > rampitec wrote:
> > > > So the cluster does not really hold?
> > > Currently, clusters will be broken by: 1. higher priority instructions, or by 2. independent instructions. 
> > > 
> > > Here we see an independent instruction filling in the gap caused by hardware hazard. I have tried disabling fillMFMAShadow but this does not change the behavior. I think if we want unbroken clusters using SDep::Cluster, we need to address this via a different SchedStrategy (specifically, the logic in tryCandidate and pickNode). Should I start thinking about this?
> > > 
> > > In the context of CK -- broken clusters will cause problems if clusters of different type blend together. However, I think this won't happen due to dependencies -- hard to say without sample MIR.
> > If that is caused by a hazard there is nothing we can really do about it, it will be broken that way or another. Thanks.
> Hey Stas -- thanks for your thoughts on this.
> 
> Based on your comment yesterday, I looked deeper into the broken cluster issue and actually found a couple flaws in this clustering algorithm which can result in avoidable broken clusters. I have addressed these and will release patch soon.
> 
> However, these changes will not affect the hazard issue identified here. I have just experimented with a feature that resolves MAI hazards in the scheduler before picking the next node. With this scheduler hack, we have perfect clustering for these tests. If we want unbroken clusters, I think we will need to expose a hacked scheduler. I have also considered bundling instead of clustering, but I think that will not work.
I don't think it is necessarily a problem if they are not perfect clusters. The main idea is to have MAC clusters and VMEM/LDS clusters, not to have a specific number of perfectly sequential MFMA.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124678/new/

https://reviews.llvm.org/D124678