[PATCH] D104730: [MCA] [AMDGPU] Adding CustomBehaviour implementation for AMDGPU.

Tue Jun 22 11:20:46 PDT 2021

holland11 created this revision.
holland11 added reviewers: andreadb, qcolombet, foad, arsenm.
Herald added subscribers: kerbowa, jfb, gbedwell, hiraditya, t-tye, tpr, dstuttard, yaxunl, nhaehnle, jvesely, kzhuravl.
holland11 requested review of this revision.
Herald added a subscriber: wdng.
Herald added a project: LLVM.

**TLDR** The `s_waitcnt` instructions do not currently have the proper behaviour within `llvm-mca` (due to the scheduling model itself not expressing what that proper behaviour is). This patch utilizes the `CustomBehaviour` class within mca to enforce that behaviour so that mca can simulate the instructions properly. The patch also makes some slight modifications to the AMDGPU scheduling model, but the side effects of these changes //should// be isolated to mca (the RetireOOO flag that is added to `s_waitcnt` instructions is specific to mca).

This patch is related to https://reviews.llvm.org/D104149. In that patch, I ended up pushing an empty implementation for the AMDGPUCustomBehaviour class. @foad then fixed some of the instruction flags and I was able to re-implement the waitcnt logic to be much more similar to `SIInsertWaitcnts::updateEventWaitcntAfter()`.

Some imperfections within this implementation that I'd like to point out:

- AFAIU, most waitcnt related instructions only increment and decrement the relevant CNTs by 1, however there are a handful of instructions that increment and decrement by more than 1. I do not have a good enough understanding of this, so the implementation treats them all as if they increment/decrement by only 1. Someone with a stronger understanding can feel free to make some changes to get this more accurate if they want.
- `AMDGPUCustomBehaviour::generateWaitCntInfo()` is the function that determines whether an instruction interacts with any of the CNTs and which CNTs it interacts with (if any). The logic is very similar to `SIInsertWaitcnts::updateEventWaitcntAfter()` however, there are function calls from that pass that I was unable to recreate so for now I'm just making conservative assumptions. Refer to the comments within `generateWaitcntInfo()` for an explanation. Two of the functions that I couldn't recreate are related to memory operands and I'm not entirely sure if it's possible to recreate those functions using `MCInst` rather than the `MachineInstr` that are used in the pass this is all based on. The third function that I was unable to recreate is just used to determine how old the specific subtarget is. There is probably a way to recreate this functionality, but I'm not familiar enough with the different subtargets so I didn't want to mess around with it.

Here is the relevant comment for the last two sentences above:

  // This should be:
  // if (GCNTarget.vmemWriteNeedsExpWaitcnt() &&
  //    (MCID.mayStore() || (MCID.TSFlags & SIInstrFlags::IsAtomicRet)))
  // where the GCNTarget::vmemWriteNeedsExpWaitcnt() function is
  // { return getGeneration() < SEA_ISLANDS; }
  // But I'm not sure how to get the subtarget's generation from here.
  // For now, conservatively assume that this is true, but maybe an
  // AMDGPU dev can suggest a solution.

So if anyone knows how I can recreate the `{ return getGeneration() < SEA_ISLANDS; }` behaviour using either a `AMDGPU::IsaVersion` or a `MCSubtargetInfo`, I'd be happy to update this diff before it gets committed, or you can just patch it in the future.

Here's a summary of what this implementation is doing:

- The `AMDGPUInstrPostProcess` class is used during the lowering from `MCInst` to `mca::Instruction`. `mca::Instruction` objects do not have any information about immediate operands (since they aren't normally relevant for hazard checking in mca), however the `s_waitcnt` instructions have important information contained within an immediate operand so we use the `InstrPostProcess` class to store those operands for any `s_waitcnt` instructions within the source.
- When the `AMDGPUCustomBehaviour` class is constructed (shortly before the pipeline is created within mca), we call the `generateWaitcntInfo` method. This method iterates over each of the `mca::Instruction` objects to determine (and store for later use) which instructions interact with which CNTs (`vmcnt`, `expcnt`, `lgkmcnt`, `vscnt`). This could also be done on the fly, but since in general, mca simulates the source for multiple iterations, it's more efficient to only have to run this logic once for each instruction. A debugger can be used to pause after this method is run and then you can inspect the `InstrWaitCntInfo` vector to check whether each instruction is being associated with the correct CNTs.
- During the pipeline simulation, whenever an `s_waitcnt` instruction is encountered, we check each of the instructions that are currently executing within the pipeline, and using the information stored within `InstrWaitCntInfo`, we determine if this `s_waitcnt` instruction should force a stall or not.

And three more issues worth noting:

- I'm not sure how the `s_waitcnt_depctr` instruction works. I couldn't find any information within the ISAs, so I just didn't implement anything for this instruction.
- I believe the `s_endpgm` instruction also behaves as if there is an implicit `s_waitcnt 0` instruction before it, however I did not implement this behaviour.
- Before this patch, the `s_waitcnt` instructions all used the `WriteSALU` scheduling class. To give the `s_waitcnt` instructions the `RetireOOO` flag, I created a new scheduling class `WriteSALUOOO` that should behave exactly like `WriteSALU` and should only be associated with the different `s_waitcnt` instructions. For some models, these classes have a latency of 2, and for others they have a latency of 1. The `WriteSALUOOO` class also uses the same hardware units as `WriteSALU`. Someone who knows more about the low level behaviour of `s_waitcnt` may be able to fine tune this a bit better (with respect to latency and hardware units).

Thank you for your time. If you have any questions or suggestions, I'd be happy to discuss further. If you want me to make any modifications, I'm happy to do so (as long they are explained well enough for me to actually make the changes myself). If you want to make any modifications to this class in the future, I encourage you to do so.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D104730

Files:
  llvm/lib/Target/AMDGPU/SISchedule.td
  llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp
  llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.h

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D104730.353701.patch
Type: text/x-patch
Size: 18660 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210622/7c1b47b5/attachment.bin>