[PATCH] D94928: [llvm-mca] Add support for in-order CPUs

Sun Jan 24 13:10:55 PST 2021

asavonic added inline comments.

================
Comment at: llvm/include/llvm/MCA/Stages/InOrderIssueStage.h:41-51
+  int NumExecuted;
+
+  /// Instructions that must be issued during the next cycle. If the front
+  /// instruction cannot execute due to an unmet register or resource
+  /// dependency, the whole queue is stalled for StallCyclesLeft.
+  std::queue<InstRef> InstQueue;
+  int StallCyclesLeft;
----------------
andreadb wrote:
> Please use unsigned quantities instead of signed integers.
Done.

================
Comment at: llvm/lib/MCA/CMakeLists.txt:20
   Stages/RetireStage.cpp
+  Stages/InOrderIssueStage.cpp
   Stages/Stage.cpp
----------------
RKSimon wrote:
> sorting
Thanks, fixed.

================
Comment at: llvm/lib/MCA/Context.cpp:31-33
 std::unique_ptr<Pipeline>
 Context::createDefaultPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr) {
   const MCSchedModel &SM = STI.getSchedModel();
----------------
andreadb wrote:
> andreadb wrote:
> > For readability reasons, I suggest to split this method into two methods.
> > 
> > For example, something like `if (!SM.isOutOfOrder()) createInOrderPipeline(opts, srcMgr);`.
> > 
> > You then need to move all the pipeline composition logic for in-order processors inside `createInOrderPipeline()`.
> > 
> > I think it would be much more readable (just my opinion though).
> I suggest to move the support for in-order pipeline composition into a separate function. I think it would help in terms of readability.
Done.

================
Comment at: llvm/lib/MCA/Context.cpp:31-40
 std::unique_ptr<Pipeline>
 Context::createDefaultPipeline(const PipelineOptions &Opts, SourceMgr &SrcMgr) {
   const MCSchedModel &SM = STI.getSchedModel();

   // Create the hardware units defining the backend.
-  auto RCU = std::make_unique<RetireControlUnit>(SM);
+  std::unique_ptr<RetireControlUnit> RCU;
+  if (SM.isOutOfOrder())
----------------
asavonic wrote:
> andreadb wrote:
> > andreadb wrote:
> > > For readability reasons, I suggest to split this method into two methods.
> > > 
> > > For example, something like `if (!SM.isOutOfOrder()) createInOrderPipeline(opts, srcMgr);`.
> > > 
> > > You then need to move all the pipeline composition logic for in-order processors inside `createInOrderPipeline()`.
> > > 
> > > I think it would be much more readable (just my opinion though).
> > I suggest to move the support for in-order pipeline composition into a separate function. I think it would help in terms of readability.
> Done.
Done.

================
Comment at: llvm/lib/MCA/Stages/InOrderIssueStage.cpp:34-37
+bool InOrderIssueStage::isAvailable(const InstRef &IR) const {
+  return Bandwidth > 0;
+}
+
----------------
andreadb wrote:
> We also need to check if IR "begins a group".
> Instructions that begin a group, must always be the first instructions to be dispatched in a cycle. See how the `isAvailable()` check is implemented by the DispatchStage.
> 
Thank you. I added a check for both BeginGroup and EndGroup.

================
Comment at: llvm/lib/MCA/Stages/InOrderIssueStage.cpp:35
+bool InOrderIssueStage::isAvailable(const InstRef &IR) const {
+  return Bandwidth > 0;
+}
----------------
andreadb wrote:
> Shouldn't we also add checks on `NumMicroOps` somewhere?
> One of the tests reports two loads dispatched within a same cycle. However one of the loads is 2 uOps, so - correct me if I am wrong - it shouldn't be possible to dispatch two of those in a same cycle.
> 
> Out of curiosity:
> 
> ``` ldr	w4, [x2], #4```
> 
> is the resource consumption info correct for that instruction?
Added a check for NumMicroOps.

Regarding the ldr: the optimization manual for Cortex-A55 states that throughput for most load instructions is 1, so I guess it is correct.

================
Comment at: llvm/lib/MCA/Stages/InOrderIssueStage.cpp:49
+/// instructions are met.
+static int checkRegisterHazard(const RegisterFile &PRF, const MCSchedModel &SM,
+                               const MCSubtargetInfo &STI, const InstRef &IR) {
----------------
andreadb wrote:
> As mentioned in another comment we should use unsigned quanitites whenever possible. So, int quantities that are only used to describe zero or more cycles should really be converted into unsigned.
> 
Done.

================
Comment at: llvm/lib/MCA/Stages/InOrderIssueStage.cpp:107-114
+  if (IS.isEliminated())
+    return;
+
+  for (ReadState &RS : IS.getUses())
+    PRF.addRegisterRead(RS, STI);
+
+  for (WriteState &WS : IS.getDefs())
----------------
andreadb wrote:
> I expect register renaming to only occur for out-of-order processors. So move elimination shouldn't ever happen in this context.
> 
> That being said, instructions that are fully eliminated at register renaming stage still have their writes propagated to the register file. So, if you want to correctly handle that case, then you cannot early exit; you still need to add writes.
> 
> Note that the only instructions that can be eliminated at register renaming stage are register moves.  See the definition and description of tablegen class `RegisterFile` in 
> https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Target/TargetSchedule.td
> 
> I don't expect any of this to really affect in-order targets. So, personally I suggest to replace the initial if-stmt with an assert (i.e. check tha IS is NOT eliminated).
Thank you for the explanation! Replaced the if statement with an assert.

================
Comment at: llvm/lib/MCA/Stages/InOrderIssueStage.cpp:142
+
+  IS.dispatch(0);
+
----------------
andreadb wrote:
> Maybe add a comment here explaining why we are passing 0 as token-id.
> Since this is an in-order processor, the retire stage (and therefore the token id) are not really required in practice. We don't expect the retire stage to do anything in particular.
> 
The code is changed a bit to accommodate an RCU. Let me know if it still needs a comment.

================
Comment at: llvm/lib/MCA/Stages/InOrderIssueStage.cpp:148-150
+
+  --Bandwidth;
+  InstQueue.push(IR);
----------------
andreadb wrote:
> `Bandwidth` should be set to zero if IR `ends a group`.
> 
> You probaby need something semantically equivalent to what method DispatchStage::dispatch(InstRef IR) does:
>  
> ```
> // Check if this instructions ends the dispatch group.                                                                                                                      if (Desc.EndGroup)
>   AvailableEntries = 0;      
> ```
Done.

================
Comment at: llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s:114-126
+# CHECK:      [0,0]     DeeeER    .    .    .   ldr	w4, [x2], #4
+# CHECK-NEXT: [0,1]     D=eeeER   .    .    .   ldr	w5, [x3]
+# CHECK-NEXT: [0,2]     .D===eeeeER    .    .   madd	w0, w5, w4, w0
+# CHECK-NEXT: [0,3]     .   DeeeER.    .    .   add	x3, x3, x13
+# CHECK-NEXT: [0,4]     .    DeeeER    .    .   subs	x1, x1, #1
+# CHECK-NEXT: [0,5]     .    D==eeeeER .    .   str	w0, [x21, x18, lsl #2]
+# CHECK-NEXT: [1,0]     .    . DeeeER  .    .   ldr	w4, [x2], #4
----------------
andreadb wrote:
> Interesting.
> According to this design, the "dispatch" event is still decoupled from the "issue" event. Is that expected?
> I am asking because in my mind, at least for in-order processors, the two events should coincide. Basically there is no reservation station, and uOPs are directly issued to the underlying execution unit at a IssueWidth maximum rate.
Fixed. Now a dispatch event and an issue event occur in the same cycle. 

================
Comment at: llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s:116-117
+# CHECK-NEXT: [0,1]     D=eeeER   .    .    .   ldr	w5, [x3]
+# CHECK-NEXT: [0,2]     .D===eeeeER    .    .   madd	w0, w5, w4, w0
+# CHECK-NEXT: [0,3]     .   DeeeER.    .    .   add	x3, x3, x13
+# CHECK-NEXT: [0,4]     .    DeeeER    .    .   subs	x1, x1, #1
----------------
andreadb wrote:
> dmgreen wrote:
> > asavonic wrote:
> > > andreadb wrote:
> > > > asavonic wrote:
> > > > > andreadb wrote:
> > > > > > Why are these two executing out of order?
> > > > > Madd and add are issued in the same cycle, subs is issued next.
> > > > > However, they should not retire out-of-order. Some instructions can
> > > > > retire out-of-order, but not these.
> > > > > 
> > > > > I have to look into this. Probably an RCU is actually needed for the
> > > > > in-order pipeline.
> > > > > 
> > > > In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution. 
> > > > In this case I was expecting a compulsory stall to artificially delay the issue of the `add` so that it can write-back in-order w.r.t. the madd.
> > > > What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order?
> > > > In theory, younger instructions should not be allowed to reach the write-back stage before older instructions because that would lead to out-of-order execution. 
> > > > In this case I was expecting a compulsory stall to artificially delay the issue of the `add` so that it can write-back in-order w.r.t. the madd.
> > > 
> > > I wonder how this works for instructions with early termination (sdiv, udiv). 
> > > @dmgreen, can you please comment on this?
> > > 
> > > > What are those cases where it is allowed to write-back instructions out of order? Shouldn't architectural commits always happen in-order?
> > > 
> > > From Cortex-A55 optimization manual, s3.5.1 "Instructions with out-of-order completion":
> > > > While the Cortex-A55 core only issues instructions in-order, due to the number of cycles required to complete more complex floating-point and NEON instructions, out-of-order retire is allowed on the instructions described in this section. The nature of the Cortex-A55 microarchitecture is such that NEON and floating-point instructions of the same type have the same timing characteristics.
> > Yeah, I was about to quote the same thing!
> > 
> > The Cortex-R52, another in-order AArch32 cpu similar to the A53 mentions:
> > https://developer.arm.com/documentation/100026/0103/Cycle-Timings-and-Interlock-Behavior/Pipeline-behavior/Dual-issuing
> > > If the Cortex-R52 processor determines that the pair must be dual-issued, it remains a pair until both instructions are retired.
> > 
> > So I believe it can depend on the CPU and possibly the instructions issued.
> I see.
> That unfortunately complicates the whole implementation.
> We do need a Retire Stage after all...
> 
> Is out-of-order execution limited in some ways in these processors?
> For example: do these processors perform register renaming to break false dependencies?
> If we don't need to worry about register renaming, then the implementation is simpler. I really hope that out-of-order execution is only really allowed in case of completely independent instructions , and just to avoid any bottlenecks caused by long latency instructions...
> 
> By the way, do we have a mechanism in place to identify instructions that can be executed out-of-order? I don't remember to have ever seen anything like that in the scheduling model.
> If not, then we need a way to mark those instructions somehow. In case, we might be and be able to reuse the MCSchedPredicate framework for that (fingers crossed).
I changed the RetireStage, RCU and the scheduler model to support this.
RetireControlUnit is now used if it is defined in the scheduler model. I ran a couple of tests on hardware, and MCA results seem to match it pretty well.

Some instructions require out-of-order retire. I added RetireOOO flag to MCSchedClass to handle this feature. Instructions with this flag do not block other instructions and may retire as soon as they complete execution.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94928/new/

https://reviews.llvm.org/D94928