[PATCH] D94928: [llvm-mca] Add support for in-order CPUs

Wed Jan 20 09:38:25 PST 2021

asavonic added a comment.

Thanks for the review Andrea!

In D94928#2506972 <https://reviews.llvm.org/D94928#2506972>, @andreadb wrote:

> Your model assumes an unbounded queue of instructions (something like rudimental reservation station) where to store dispatched instructions.

If you mean `InstQueue`, then it is bounded by `Bandwidth` variable - the
maximum number of instructions that can be issued in the next cycle.

> Correct me if I am wrong, but in-order processor don't use a reservation station. 
> In the absence of structural hazards, if data dependencies are met, then uOPs are directly issued to the underlying execution units.
> So the dispatch event is not decoupled from the issue event.
>
> The fact that your patch adds an unbounded queue sounds a bit strange to me. Not sure what @dmgreen 
>  thinks about it. But this basically means that dispatch and issue are different events.

That is true. However, the problem here is that MCA timeline view counts stalls
as a number of cycles between dispatch and issue events. If dispatch and issue
always happen in the same cycle, stalls are not displayed:

  [0,3]     .  DeeER  .    .    add	w13, w30, #1
  [0,4]     .  DeeeER .    .    smulh	x30, x29, x28
  [0,5]     .     DeeeER   .    smulh	x27, x30, x28
  [0,6]     .        DeeeER.    smulh	xzr, x27, x26
  [0,7]     .    .    DeeeER    umulh	x30, x29, x28

To avoid this, the implementation emits a dispatch event for instructions that
should be executed in the next cycle. If an instruction is unable to execute due
to a hazard, it is delayed and a stall is displayed starting from the dispatch
event:

  [0,3]     . DeeeER  .    .    add	w13, w30, #4095, lsl #12
  [0,4]     . DeeeeER .    .    smulh	x30, x29, x28
  [0,5]     .  D==eeeeER   .    smulh	x27, x30, x28
  [0,6]     .  D=====eeeeER.    smulh	xzr, x27, x26
  [0,7]     .    .  D=eeeeER    umulh	x30, x29, x28

I remember that I did this intentionally, but now I'm not really convinced that
this difference is worth extra complexity. Let me know what you think about
this.

> I also noticed how there are no checks on `NumMicroOps`. Is there a reason why you don't check for it?

Good point, I will fix that.

> In one of the tests, the target is dual-issue. However, there are cycles where three opcodes are dispatched.
> See for example the test where two loads are dispatched in a same cycle (with the first load decoded into two uOPs).

I think this should not happen. I will add a check for NumMicroOps.

================
Comment at: llvm/test/tools/llvm-mca/AArch64/Cortex/A55-all-views.s:116-117
+# CHECK-NEXT: [0,1]     D=eeeER   .    .    .   ldr	w5, [x3]
+# CHECK-NEXT: [0,2]     .D===eeeeER    .    .   madd	w0, w5, w4, w0
+# CHECK-NEXT: [0,3]     .   DeeeER.    .    .   add	x3, x3, x13
+# CHECK-NEXT: [0,4]     .    DeeeER    .    .   subs	x1, x1, #1
----------------
andreadb wrote:
> Why are these two executing out of order?
Madd and add are issued in the same cycle, subs is issued next.
However, they should not retire out-of-order. Some instructions can
retire out-of-order, but not these.

I have to look into this. Probably an RCU is actually needed for the
in-order pipeline.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94928/new/

https://reviews.llvm.org/D94928