[PATCH] D129161: [MachinePipeliner] Consider only direct path successors when calculating circuit latency

Tue Jul 19 12:16:14 PDT 2022

hgreving added a comment.

In D129161#3663456 <https://reviews.llvm.org/D129161#3663456>, @barannikov88 wrote:

> In D129161#3663186 <https://reviews.llvm.org/D129161#3663186>, @JamesNagurne wrote:
>
>> This was the issue I commented on in both of your comments re: including the back edge of the circuit in the Latency calculations for NodeSet. Our backend 'skips' the PHI and calculates the operand latency across the backedge to the true use.
>
> I think I was talking about something different, most probably updatePhiDependences.
> As you can see there, the Anti dependence created in this function gets the latency 1, not considering the latency of the instruction which feeds the PHI. Similarly, instructions which depend on PHIs get latency 0. This roughly means that the latency computed between the real instructions connected through a PHI node is 1, which in many cases far from accurate.
>
>> We recently resolved this by implementing a post-expansion insertion of scheduling barriers. Each modulo cycle is considered its own region and is, therefore, not reordered. There's some magic in the post-RA scheduler that undoes this so that we can actually schedule the whole kernel, but I digress.

> We recently resolved this by implementing a post-expansion insertion of scheduling barriers. Each modulo cycle is considered its own region and is, therefore, not reordered. There's some magic in the post-RA scheduler that undoes this so that we can actually schedule the whole kernel, but I digress.

Ideally (I am assuming you're working on a VLIW architecture) should support bundling pre-RA, until then, a downstream target needs to use workarounds like this (we're doing something similar). Both pre-RA and post-RA scheduling doesn't make sense to run on blocks that were pipelined.

> Interesting approach, thanks for sharing this! We've been thinking about storing the computed schedule somewhere outside the pipeliner pass (e.g. metadata), disable both pre- and post-RA schedulers for pipelined loops and then use the recorded schedule post-RA to form instruction bundles. But we've never been able to resolve issues with copies / spills inserted by the register allocator. They are inserted in-between scheduled "regions" which adds extra cycles. While it is possible to try to fold the inserted instructions into the nearest bundle, in general it would require at least partial rescheduling to avoid hazards / stalls. Don't know how it all ended, it has been some time since I left the project.
>
>> It's a good workaround for sure, but I believe this is a viable and correct fix to a real bug that may impact non-Hexagon users. I am, however, willing to reconsider if there is real issue with the change.
>
> I'll try to give it a closer look.

I think the fix is prob good as long as the assumption of the path order of the Johnson algorithm is correct. It's also better to underestimate RecII rather than over estimate. Important to notice that LLVM's upstream implementation of finding the cycle length includes the nodes that are part of the recurrence, because swing scheduling relies on those sets to be prioritized. Other pipeline algorithms only require the length of a recurrence but not the nodes that are part of it. Finding the MII for recurrence can be implemented much faster with a compute minimum distance matrix than the upstream implementation (though again, swing does need it).

>>> Excuse my English
>>
>> Honestly? I didn't even notice :)
>
> Oh, really? This is inspiring to say the least, thank you!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D129161/new/

https://reviews.llvm.org/D129161