[PATCH] D75138: [WIP][AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions

Wed Mar 4 16:53:37 PST 2020

scott.linder marked 7 inline comments as done.
scott.linder added a comment.

There are still a reasonable amount of FIXME/TODO in this patch, and I left some additional comments on each to highlight them and ask for feedback on them. I am not entirely comfortable with the way I went about implementing the special-casing for having no FP in the entry function. I would prefer not having all of the `isEntryFunction` checks everywhere, but I'm not sure how else to represent it?

I also would rather break this patch up more, but I don't think doing so will make it easier to understand or reduce the size of the test diffs. The only pieces I could break off naturally were some NFC changes in https://reviews.llvm.org/D75092 and switching to s33 for FP in https://reviews.llvm.org/D75657

================
Comment at: llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:290
+  unsigned NumPreloadedSGPRs = MFI->getNumPreloadedSGPRs();
+  // FIXME: This is just lifted from AMDGPUAsmPrinter, because I'm not
+  // sure where/if we track InReg SGPR arguments otherwise.
----------------
arsenm wrote:
> This should not need to inspect the original IR. Why can't this just read it directly from MFI? They should be accounted there already?
@arsenm @nhaehnle I don't think I understand how `inreg` currently works relative to "preloaded" SGPRs; is/should `inreg` be recorded somewhere in the machine function info so this isn't necessary?

================
Comment at: llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:309
+    // FIXME: The preloaded SGPR count doesn't seem to be completely accurate,
+    // SITargetLowering::allocateSystemSGPRs just picks the next free SGPR for
+    // the scratch wave offset. To work around this we ask the caller for the
----------------
Similar question here, should there be a change in `SITargetLowering` so the preloaded count is correct?

================
Comment at: llvm/lib/Target/AMDGPU/SIFrameLowering.cpp:602-603
+
+    // Save and restore SRSRC bits [48:63]. We only want to update the base
+    // address in bits [0:47].
+    BuildMI(MBB, I, DL, TII->get(AMDGPU::S_AND_B32), SavedWord)
----------------
scott.linder wrote:
> arsenm wrote:
> > arsenm wrote:
> > > arsenm wrote:
> > > > arsenm wrote:
> > > > > arsenm wrote:
> > > > > > Do we actually need these bits? I'm fairly confident these are always 0 in the HSA resource descriptor (or at least are a known constant we can just reproduce later)
> > > > > According to this it's hardcoded: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/core/runtime/amd_aql_queue.cpp#L1015
> > > > > 
> > > > > We just need to worry about SWIZZLE_ENABLE being set to 1. This is the high bit, so all it can do is trigger a carry on the second add. So I think that means you can get away with just doing the add, and then using s_bitset1_b32 to ensure it wasn't carried away
> > > > Actually, I don't think any add that fits in the 48-bit address space should ever touch the high bits (although I usually manage to be wrong about known bits optimizations with adds)
> > > I think this means it's OK to just not worry about the high bits: https://rise4fun.com/Alive/i24
> > > 
> > As long as we know bit 48 is 0, this seems fine. As this is hardcoded in the driver, this is probably OK https://rise4fun.com/Alive/KmH
> That make sense to me, and this would simplify things a lot. I don't quite understand if we need to ensure [48:62] are 0, though? If the addc carries into bit 48 is that an issue? I.e. https://rise4fun.com/Alive/qsv
> 
> At the very least, it seems like we can avoid the need to save anything and just mask in a constant, but if it is possible to avoid that too it removes a couple additional instructions from nearly every kernel prologue.
I went the route of just always doing the 64-bit add of the scratch wave offset into the SRsrc rather than saving anything or using known constants for some of the bits. From some other discussion this should always be correct.

================
Comment at: llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h:340
+  // function, this is 0.
   unsigned FrameOffsetReg = AMDGPU::FP_REG;

----------------
arsenm wrote:
> These should be switched to Register at some point
I haven't gotten around to this yet, but I'll do this in another NFC patch.

================
Comment at: llvm/test/CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll:8
+; seems to help GreedyRA avoid a spill, but somehow ends up with another
+; SGPR used.
+
----------------
Can anyone help me understand what we are trying to test here? It seems likely the amount of live SGPRs and the amount of available SGPRs needs to be adjusted to have this test continue to be meaningful, but in trying to correct it I realized I wasn't sure what it was testing in the first place.

================
Comment at: llvm/test/CodeGen/AMDGPU/scratch-simple.ll:100
+; for GFX9). What exactly is the test trying to verify, and is the change to
+; mark the scratch wave offset as "killed" by the new setup in the prologue OK?
----------------
@arsenm @nhaehnle Similar question as above wrt. how `inreg` should work. Is the `%swo` argument in these expected to actually be allowed to coincide with the scratch wave offset?

================
Comment at: llvm/test/CodeGen/AMDGPU/spill-offset-calculation.ll:52
+; FIXME: If we fail to scavenge an SGPR in a kernel we don't have a stack
+; pointer to temporarily update, so we just crash.

----------------
Is it OK for us to fail here? This is a consequence of not having a frame pointer in entry functions and not being able to e.g. restart RA after we realize we really need it in this case.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75138/new/

https://reviews.llvm.org/D75138