[llvm-bugs] [Bug 49499] llvm-mca for cortex-a57 gets thrown off by SIMD loads with dependencies (negative latency?)

via llvm-bugs llvm-bugs at lists.llvm.org
Wed Mar 10 04:41:40 PST 2021


https://bugs.llvm.org/show_bug.cgi?id=49499

Martin Storsjö <martin at martin.st> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|NEW                         |RESOLVED

--- Comment #2 from Martin Storsjö <martin at martin.st> ---
(In reply to Andrea Di Biagio from comment #1)
> tl;dr: there is no negative latency. WAW dependencies are effectively broken
> by the register renamer.
> 
> Cortex-a57 is an out-of-order processor. The default llvm-mca pipeline for
> out-of-order processors assumes the presence of a register renamer.
> 
> It means that false dependencies are effectively broken by the register
> renamer at the cost of consuming a physical register.
> 
> As far as I undestand, each ADD has a latency of 3cy. Also, ADD instructions
> are in a dependency chain. When simulating multiple iterations, there is an
> implicit loop carried dependency (i.e the first ADD of an iteration must
> wait for the result from the last ADD of a previous iteration). That's why
> latency converges to 900cy for the first experiment.
> 
> In the second experiment, you have inserted a load which writes the same
> registers defined by the following ADD instructions.
> The LD1 introduces new definitions for v0.16b, v1.16b, v2.16b, v3.16b.
> There is a WAW dependency on each of those registers. In the absence of
> register renaming, that load would need to wait until those registers are
> written. In practice however, the register renamer "renames" breaks those
> dependencies, so the LOAD doesn't need to wait on those definitions.
> 
> The throughput of LD1 is still limited (roughly one LD1 every 4
> instructions). Therefore, every 4 cycles, the first ADD of a new iteration
> can start execution. That's how you end up with that low number of cycles.
> 
> The last example is just like the first, with the extra LD1. The LD1 is
> independent from the other instructions, so it can always execute as soon as
> the units are available.
> 
> NOTE: by default, llvm-mca assumes that register renaming is always
> successful (i.e. as if there is an unbounded number of phys registers
> available for renaming). Renaming can be limited by introducing a (optional)
> `RegisterFile` definition in the scheduling model. For an example of
> `RegisterFile`, see the definition of `JIntegerPRF` in
> X86/X86ScheduleBtver2.td.

Thanks for the thorough explanation! That does indeed explain it, and by
setting e.g. `--iterations 1`, I also see numbers that match up better with my
expectations.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210310/21ca0b5d/attachment.html>


More information about the llvm-bugs mailing list