[PATCH] D98604: [MCA] Ensure that writes occur in-order

Tue Mar 16 12:58:22 PDT 2021

asavonic added a comment.

In D98604#2628693 <https://reviews.llvm.org/D98604#2628693>, @andreadb wrote:

> Is there a reason why you don't allow the bypassing of the write-back hazard check for all `RetiresOOO` instructions?
>
> NOTE: normally this may not be a problem in practice, since `RetiresOOO` instructions (for what I have read) tend to have a high write latency. Still, it sounds a bit odd to me that we lift the check in one case, but not the other.

You're right, I missed that. The  issue does not show up in the tests because of high latency of FDIV. This is now fixed in the new revision of the patch.

> The above example is very important because it clearly contradicts my original assumption on the write-back checks.
> If the write-back check was still performed, then the ADD would be in the critical path, and the block latency would be proportional to the length of the ADDs tail.
> However, your comment suggests that benchmarks report a smaller latency, and therefore the ADD sequence latency is hidden by the FDIV latency. Meaning, that those ADDs are not in the critical path. It should become even more obvious if you make that tail of ADDs even longer (but still small enough to be fully hidden by the FDIV latency).

Right, this was the intention of the experiment.
I also found that the same feature is described differently in the Cortex-R52 manual:

  Floating-point divisions and square roots (VDIV and VSQRT instructions)
  operate out-of-order. These instructions always complete in a single cycle
  but there is a delay in the division or square root result being written to
  the register file. Subsequent instructions are allowed to issue, execute,
  and retire, provided they do not depend on the result of the VDIV, or VSQRT,
  and they are not VDIV or VSQRT themselves. If a dependency is detected, the
  pipeline interlocks and waits for the availability of the result before
  continuing.

I think this confirms our findings.

Although this part is a bit problematic: "Subsequent instructions are allowed to
issue, execute, and retire, provided they do not depend on the result of the
VDIV, or VSQRT, **and they are not VDIV or VSQRT themselves**". MCA does not handle
this, but this should never happen (at least for A55), because the FDIV/VSQRT
instructions go through the same FP-pipeline, and take `ResourceCycles`
according to their latency.

> Back to the code:
> I still think that you only need a latency value instead of storing `LastIssuedInst`. Also, shouldn't `LastIssuedInst` always point to the last non-`RetireOOO` instruction?

Right, both points should be addressed in the new revision of the patch.

>   Note that I had to fix the FDIV latency/resource-cycles in LLVM scheduler model.
>   Original values match the ARM documentation, but I get different timings just by running FDIV in a loop.
>
> Unrelated to this patch. However, can it be that the FDIV latency is somehow dependent on the values passed in input? It may be worthy to run the same micro-benchmark with a variety of different values in input, to make sure that the execution units are not taking shortcuts. Just an idea.

Documentation for Cortex-A55 does not mention this, and the manual for
Cortex-R52 states that latencies for FDIV are data-independent. But that is a
good point. I should set all registers to specific values, and then change them
to see the difference.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98604/new/

https://reviews.llvm.org/D98604