[PATCH] D98604: [MCA] Ensure that writes occur in-order

Tue Mar 16 05:18:48 PDT 2021

andreadb added a comment.

Hi Andrew,

Thanks to your last comment, I think I know have a better idea of how this all should work.
I still have some questions though (see below).

> I don't think that Cortex-A55 can execute instructions out-of-order. Some FP instructions can commit/retire out-of-order,
> but they use a different set of registers and their timing characteristics are aligned (according to the documentation).

OK. So out-of-order commit doesn't necessarily imply out-of-order execution. Makes sense.

Back to your check:

  if (LastIssuedInst && !LastIssuedInst.getInstruction()->getDesc().RetireOOO) {
    // Delay the instruction to ensure that writes occur in program
    // order
    if (unsigned StallWritesOrder = checkWritesOrder(LastIssuedInst, IR)) {
      *StallCycles = StallWritesOrder;
    }
  }

Unless I've read the code wrongly (which is a possibility...), then the check is only on the predecessor.
This works for example where an ADD follows an FDIV. However, it doesn't work if the FDIV follows the ADD. Is this expected?

According to that check, a `RetiresOOO` instruction following an ADD still needs to check for any potential structural hazards. It means that we still want it to reach the write-back stage in-order, even though - technically speaking - it is marked as `RetiresOOO`.

Is there a reason why you don't allow the bypassing of the write-back hazard check for all `RetiresOOO` instructions?

NOTE: normally this may not be a problem in practice, since `RetiresOOO` instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.

  Can you elaborate why ADD needs to wait for the completion of FDIV?
  If FDIV is allowed to retire out-of-order, then it can retire after ADD, so ADD doesn't have to wait.

So, the current semantic allows an ADD (or any other instruction following a `RetireOOO` instruction) to ignore the structural hazard check on the write-back.

Example:

  [0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
  [0,1]     DeeER.    .    .    .    .   add	w8, w8, #1

However - at least in theory - while the FDIV can commit out-of-order, the ADD is not "special" in any way. In normal circumstances, I would have expected the ADD to be subject to the "usual" structural hazard checks for the write-back.

Thankfully, your comment clarified this to me. You mentioned the following very important experiment:

  The model we have in MCA now seems to match the hardware (at least by IPC):

  [0,0]     DeeeeeeeeeeeER .    .    .   fdiv	s1, s2, s3
  [0,1]     DeeER.    .    .    .    .   add	w8, w8, #1
  [0,2]     .DeeER    .    .    .    .   add	w1, w2, w0
  [0,3]     .DeeER    .    .    .    .   add	w3, w4, #1
  [0,4]     . DeeER   .    .    .    .   add	w5, w6, w0
  [0,5]     .    .  DeeeeeeeeeeeER   .   fdiv	s4, s5, s6
  [0,6]     .    .  DeeER  .    .    .   add	w7, w9, w0
  [0,7]     .    .   DeeER .    .    .   add	w3, w1, w3
  [0,8]     .    .   DeeER .    .    .   add	w7, w2, w4
  [0,9]     .    .    DeeER.    .    .   add	w5, w5, w9

The above example is very important because it clearly contradicts my original assumption on the write-back checks.
If the write-back check was still performed, then the ADD would be in the critical path, and the block latency would be proportional to the length of the ADDs tail.
However, your comment suggests that benchmarks report a smaller latency, and therefore the ADD sequence latency is hidden by the FDIV latency. Meaning, that those ADDs are not in the critical path. It should become even more obvious if you make that tail of ADDs even longer (but still small enough to be fully hidden by the FDIV latency).

Back to the code:
I still think that you only need a latency value instead of storing `LastIssuedInst`. Also, shouldn't `LastIssuedInst` always point to the last non-`RetireOOO` instruction?

  Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
  Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.

Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98604/new/

https://reviews.llvm.org/D98604