[PATCH] D104149: [MCA] Adding the CustomBehaviour class to llvm-mca

Fri Jun 11 13:23:41 PDT 2021

holland11 created this revision.
holland11 added reviewers: arsenm, qcolombet, andreadb.
holland11 added a project: LLVM.
Herald added subscribers: foad, kerbowa, jfb, gbedwell, hiraditya, tpr, mgorny, nhaehnle, jvesely.
holland11 requested review of this revision.
Herald added a subscriber: wdng.

**TLDR**: 
Some instructions are not defined well enough within the target’s scheduling model for llvm-mca to be able to properly simulate its behaviour. The ideal solution to this situation is to modify the scheduling model, but that’s not always a viable strategy. Maybe other parts of the backend depend on that instruction being modelled the way that it is. Or maybe the instruction is quite complex and it’s difficult to fully capture its behaviour with tablegen. The CustomBehaviour class (which I will refer to as CB frequently) is designed to provide intuitive scaffolding for developers to implement the correct modelling for these instructions.

**Some quick notes before I get into some of the implementation details**: 
I built this class specifically for a downstream architecture, but I tried to design it in a way that would be useful for all targets. I wanted to be able to submit these changes upstream so I found an instruction from the AMDGPU instruction set that is used quite frequently, but isn’t currently modelled properly (resulting in significantly inaccurate analysis generated by llvm-mca). My knowledge of AMDGPU is limited and there is also an issue with flags not being set correctly on many of the relevant AMDGPU instructions so my implementation isn’t perfect. However, it should still provide an example use case for CB and also an example of how someone could use the class. If any AMDGPU developers would like to improve my implementation, I encourage you to do so. And on that note, if the AMDGPU folks don’t want this at all, we can definitely remove it from the patch and stick to pushing CB without any upstream examples.

This patch only has CB integrated into the in-order pipeline, however it was designed to fit into both pipelines. I’m much more familiar with the in-order pipeline (and CB fits into it a bit more intuitively), but I plan on adding it to the out-of-order pipeline in the future. I would like some feedback on what would be the best way to do that, but for now I want to keep the scope of this patch as limited as possible so I will make another post soon asking for feedback and to open up a discussion towards integrating CB within the out-of-order pipeline.

Also, I'm Canadian so I hope the 'u' in behaviour doesn't bother anyone :).

**Implementation details**:

llvm-mca does its best to extract relevant register, resource, and memory information from every `MCInst` when lowering them to an `mca::Instruction`. It then uses this information to detect dependencies and simulate stalls within the pipeline. For some instructions, the information that gets captured within the `mca::Instruction` is not enough for mca to simulate them properly. In these cases, there are two main possibilities:

1. The instruction has a dependency that isn’t detected by mca.
2. mca is incorrectly enforcing a dependency that shouldn’t exist.

For the rest of this discussion, I will be focusing on (1), but I have put some thought into (2) and I may revisit it in the future.

So we have an instruction that has dependencies that aren’t picked up by mca. The basic idea for both pipelines in mca is that when an instruction wants to be dispatched, we first check for register hazards and then we check for resource hazards. This is where CB is injected. If no register or resource hazards have been detected, we make a call to `CustomBehaviour::checkCustomHazard()` to give the target specific CB the chance to detect and enforce any custom dependencies.

The return value for `checkCustomHazaard()` is an unsigned int representing the (minimum) number of cycles that the instruction needs to stall for. It’s fine to underestimate this value because when StallCycles gets down to 0, we’ll end up checking for all the hazards again before the instruction is actually dispatched. However, it’s important not to overestimate the value and the more accurate your estimate is, the more efficient mca’s execution can be.

In general, for `checkCustomHazard()` to be able to detect these custom dependencies, it needs information about the current instruction and also all of the instructions that are still executing within the pipeline. The mca pipeline uses `mca::Instruction` rather than `MCInst` and the current information encoded within each `mca::Instruction` isn’t sufficient for my use cases. I had to add a few extra attributes to the `mca::Instruction` class and have them get set by the `MCInst` during instruction building. For example, the current `mca::Instruction` doesn’t know its opcode, and it also doesn’t know anything about its immediate operands (both of which I had to add to the class).

With information about the current instruction, a list of all currently executing instructions, and some target specific objects (`MCSubtargetInfo` and `MCInstrInfo` which the base CB class has references to), developers should be able to detect and enforce most custom dependencies within `checkCustomHazard`. If you need more information than is present in the `mca::Instruction`, feel free to add attributes to that class and have them set during the lowering sequence from `MCInst`.

Fortunately, in the in-order pipeline, it’s very convenient for us to pass these arguments to `checkCustomHazard`. The hazard checking is taken care of within `InOrderIssueStage::canExecute()`. This function takes a `const InstRef` as a parameter (representing the instruction that currently wants to be dispatched) and the `InOrderIssueStage` class maintains a `SmallVector<InstRef, 4>` which holds all of the currently executing instructions. For the out-of-order pipeline, it’s a bit trickier to get the list of executing instructions and this is why I have held off on implementing it myself. This is the main topic I will bring up when I eventually make a post to discuss and ask for feedback.

CB is a base class where targets implement their own derived classes. If a target specific CB does not exist (or we pass in the -disable-cb flag), the base class is used. This base class trivially returns 0 from its `checkCustomHazard()` implementation (meaning that the current instruction needs to stall for 0 cycles aka no hazard is detected). For this reason, targets or users who choose not to use CB shouldn’t see any negative impacts to accuracy or performance (in comparison to pre-patch llvm-mca).

**AMDGPU Example**:

In AMDGPU, there is a family of `s_waitcnt` instructions. The purpose of these instructions is to force a stall until certain memory operations have been completed. For example, if we are loading a value to `gpr5` and then using `gpr5`, one of the backend passes will detect this and place an `s_waitcnt` instruction before the use to ensure that `gpr5` has been written to. These instructions interact with 4 different ‘counters’ (vscnt, expcnt, lgkmcnt, and vscnt) and `s_waitcnt` can force a wait on any number of them. Each memory instruction will also interact with 1 or more of these counters and (this is an oversimplification) the instruction will increment those counters when it begins executing, and then decrement the counters once it has written its result.

>From a high level, when CB sees an `s_waitcnt` instruction, it will first decode its operand to determine what it’s waiting for. CB then needs to look at all of the currently executing instructions to determine if we need to stall or not. This is where things get a little tricky with respect to AMDGPU.

For each instruction within the pipeline, we need to determine if that instruction interacts with any of the counters (and which ones specifically). The backend pass that I mentioned earlier (that adds `s_waitcnt` instructions during codegen) has a function that does just that. However, this pass is dealing with ‘psuedo’ instructions, whereas in mca, we are dealing with ‘real’ instructions. This is relevant because many of the flags that the logic depends on (such as mayLoad, mayStore, and even some target specific flags) are not always shared between the ‘pseudo’ and ‘real’ instructions. This makes it really difficult to determine with certainty which instructions interact with which counters.

To combat this, I made some conservative assumptions (such as assuming mayLoad and mayStore will always be true). These assumption result in some instructions being flagged as interacting with more CNTs than they actually do. I do not make any assumptions about the target specific flags though and this results in certain instructions (such as `ds_read` and `ds_write` variants) not being detected at all. I highly suggest that if the AMDGPU folks would like to use llvm-mca, they look into having the flags properly applied to the ‘real’ instructions and then modifying the logic from this stage to be copied from the `SIInsertWaitcnts::updateEventWaitcntAfter()` function.

Another issue that I had to face was that mca (properly) enforces in-order writes on all instructions (even ones that don’t write anything such as `s_waitcnt`). This means that if there is a 300 cycle memory instruction currently executing, and then an `s_waitcnt` comes along (that isn’t meant for the 300 cycle instruction), we’ll still end up stalling  for those ~300 cycles (because otherwise, the `s_waitcnt` instruction would finish before the memory instruction). However, mca uses a `RetireOOO` flag that can be set on scheduling classes within tablegen. For this reason, I modified the AMDGPU tablegen slightly to set that flag for the `s_waitcnt` variants. The flag is only used within mca, so (in theory) this shouldn’t have any side-effects. As mentioned earlier, if the AMDGPU folks do not want these changes, I’m happy to remove this example from the patch.

With all that background information out of the way, let’s take a look at a few examples.

  ds_read_u8 v1, v0
  s_waitcnt expcnt(0)
  v_mul_lo_u32 v5, v3, v4
  s_waitcnt lgkmcnt(0)

`ds_read_u8` doesn’t interact with `expcnt` at all so the first `s_waitcnt` shouldn’t force a stall. The behaviour of the `v_mul` instruction is one that I’m not positive on though as far as in-order writes are concerned (in AMDGPU, should that first `v_mul` be allowed to complete before the `ds_read`?). Either way, the first wait should be dispatched and executed as soon as it’s seen.

Here’s the mca timeline BEFORE this patch:

  Timeline view:
                      0123456789          
  Index     0123456789          0123456789
  [0,0]     DeeeeeeeeeeeeeeeeeeeE    .      ds_read_u8 v1, v0
  [0,1]     .    .    .    .  DeE    .      s_waitcnt expcnt(0)
  [0,2]     .    .    .    .   DeeeeeeeE    v_mul_lo_u32 v5, v3, v4
  [0,3]     .    .    .    .    .    DeE    s_waitcnt lgkmcnt(0)

Notice how mca’s enforcement of the in-order writes causes the first `s_waitcnt` to stall.

Now here’s the mca timeline without CB, but with the `RetireOOO` flag set for all `s_waitcnt` instructions:

  Timeline view:
                      0123456789 
  Index     0123456789          0
  [0,0]     DeeeeeeeeeeeeeeeeeeeE   ds_read_u8 v1, v0
  [0,1]     .DeE .    .    .    .   s_waitcnt expcnt(0)
  [0,2]     .    .    . DeeeeeeeE   v_mul_lo_u32 v5, v3, v4
  [0,3]     .    .    .  DeE    .   s_waitcnt lgkmcnt(0)

And here’s with CB:

  Timeline view:
                      0123456789   
  Index     0123456789          012
  [0,0]     DeeeeeeeeeeeeeeeeeeeE .   ds_read_u8 v1, v0
  [0,1]     .DeE .    .    .    . .   s_waitcnt expcnt(0)
  [0,2]     .    .    . DeeeeeeeE .   v_mul_lo_u32 v5, v3, v4
  [0,3]     .    .    .    .    DeE   s_waitcnt lgkmcnt(0)

The first `s_waitcnt` correctly doesn’t get stalled. The second `s_waitcnt` correctly does get stalled.

Here’s a different example (the formatting breaks in this post due to the long sequence of characters so I use screenshots for the timelines):

  s_load_dwordx4 s[0:3], s[4:5], 0x0
  s_waitcnt vmcnt(0) lgkmcnt(0)
  s_waitcnt_vscnt null, 0x0
  global_load_dword v1, v0, s[0:1] glc dlc
  s_waitcnt vmcnt(0)

Here is the timeline BEFORE this patch has been applied:
https://i.imgur.com/OZQtrE1.png

You’ll notice that the simulation is actually fairly accurate for this example, but it’s accurate for the wrong reasons. The `s_waitcnt` instructions aren’t waiting because they know they have to wait for the specific `s_load_dword` and `global_load_dword` instructions. They are waiting because mca is enforcing the in-order write-backs. While this may produce a reasonable output in this case, it’s not good in general.

Here is the timeline AFTER this patch has been applied, but with the `-disable-cb` command (so basically what it would look like to run the example through mca with `s_waitcnt` instructions having the `RetireOOO` flag):
https://i.imgur.com/s62cKYJ.png

You can see now that the `s_waitcnt` instructions are allowed to finish out-of-order, and without CB to enforce their behaviour, they don’t force a stall.

Now here’s the patch with CB enabled:
https://i.imgur.com/VMlG0DO.png

CB is able to detect that `s_waitcnt vmcnt(0) lgkmcnt(0)` needs to wait for `s_load_dwordx4` to finish. And `s_waitcnt vmcnt(0)` needs to wait for `global_load_dword` to finish.

Overall, due to the flags not being reliable, if the AMDGPU developers are not interested in improving / correcting my implementation, it would probably be best to remove it from the patch. I really just wanted an upstream example to show how CB could be used, but I didn’t anticipate all of the complications that arose from this specific example. llvm-mca can be a really helpful tool for many different purposes, but especially for finding mistakes in the scheduling models. For example, I noticed that a sequence of ds_read instructions do not get simulated properly:

  ds_read_u8 v1, v0
  ds_read_u8 v2, v0 offset:1
  ds_read_u8 v3, v0 offset:2
  ds_read_u8 v4, v0 offset:3
  ds_read_u8 v5, v0 offset:4

  Timeline view:
                      0123456789          0123456789          0123456789          0123456789          0123456789 
  Index     0123456789          0123456789          0123456789          0123456789          0123456789          0
  [0,0]     DeeeeeeeeeeeeeeeeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   ds_read_u8 v1, v0
  [0,1]     .    .    .    .    DeeeeeeeeeeeeeeeeeeeE    .    .    .    .    .    .    .    .    .    .    .    .   ds_read_u8 v2, v0 offset:1
  [0,2]     .    .    .    .    .    .    .    .    DeeeeeeeeeeeeeeeeeeeE    .    .    .    .    .    .    .    .   ds_read_u8 v3, v0 offset:2
  [0,3]     .    .    .    .    .    .    .    .    .    .    .    .    DeeeeeeeeeeeeeeeeeeeE    .    .    .    .   ds_read_u8 v4, v0 offset:3
  [0,4]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DeeeeeeeeeeeeeeeeeeeE   ds_read_u8 v5, v0 offset:4

As far as I understand, these instructions shouldn’t have to wait for each other. However, `ds_read_u8` is modeled as having both a read and a write to the `M0` register which mca detects as a register hazard. I looked in the ISA and I’m fairly confident that this isn’t accurate. If you start using mca, you will likely be able to find a lot of other discrepancies such as this one. And if you do want to start using the tool, the CB class can offer you the scaffolding to implement any instruction behaviour that you can’t / don’t want to model in the tablegen itself.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D104149

Files:
  llvm/docs/CommandGuide/llvm-mca.rst
  llvm/include/llvm/MCA/Context.h
  llvm/include/llvm/MCA/CustomBehaviour.h
  llvm/include/llvm/MCA/HWEventListener.h
  llvm/include/llvm/MCA/Instruction.h
  llvm/include/llvm/MCA/Stages/InOrderIssueStage.h
  llvm/lib/MCA/CMakeLists.txt
  llvm/lib/MCA/Context.cpp
  llvm/lib/MCA/CustomBehaviour.cpp
  llvm/lib/MCA/InstrBuilder.cpp
  llvm/lib/MCA/Stages/InOrderIssueStage.cpp
  llvm/lib/Target/AMDGPU/SISchedule.td
  llvm/test/tools/llvm-mca/AMDGPU/gfx10-waitcnt-custom-behaviour.s
  llvm/test/tools/llvm-mca/AMDGPU/gfx10-waitcnt-depctr-unsupported-warning.s
  llvm/tools/llvm-mca/CMakeLists.txt
  llvm/tools/llvm-mca/Views/DispatchStatistics.cpp
  llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp
  llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.h
  llvm/tools/llvm-mca/lib/AMDGPU/CMakeLists.txt
  llvm/tools/llvm-mca/lib/CMakeLists.txt
  llvm/tools/llvm-mca/lib/X86/CMakeLists.txt
  llvm/tools/llvm-mca/lib/X86/X86CustomBehaviour.cpp
  llvm/tools/llvm-mca/lib/X86/X86CustomBehaviour.h
  llvm/tools/llvm-mca/llvm-mca.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D104149.351537.patch
Type: text/x-patch
Size: 51368 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210611/d1354a8c/attachment-0001.bin>