[llvm] AMDGPU: Fix temporal divergence introduced by machine-sink and performance regression introduced by D155343 (PR #67456)

Mon Oct 2 17:18:30 PDT 2023

https://github.com/nhaehnle commented:

Thanks for the rework. I still think this needs to be improved to be done right.

The test case still feels way too large. It will suffer from random unrelated changes. It ought to be possible to write a more focused MIR test that really captures the essence of what's happening: A uniform value inside of a cycle is defined in an SGPR vreg, which is then used by a VALU instruction that would ordinarily be a candidate for sinking.

The solution itself also doesn't feel the it's From The Book. Consider:

* Recomputing DomTree, Cycle Info, and Machine Uniformity during the fixup feels like a lot of effort
* Machine Uniformity on wave-level MIR seems like a fundamentally dubious proposition in the first place
* The resulting code isn't great. In the new test case, it would almost certainly be better to not sink the `v_add` and leave it inside the loop.

Let's have some ambition to do better than this.

The old logic that prevented sinking of instructions was a convoluted way of preventing the VALU instruction from being moved past an EXEC change. But that didn't really capture the *essence* of what the bug was.

What I think is closer to the essence of the bug is that the VALU instruction reads from an SGPR, and sinking it changes the value that is being read.

Take a look at FindSuccToSinkTo. There's a piece of logic there that will refuse to sink an instruction that uses a physical register, unless that physical register is constant. That feels very similar to the kind of problem from the SWDEV-407790 test. Unfortunately, it's difficult to fit an analogous test for the SGPR condition into the same loop over operands, because we need to know the sink target in order to perform the check.

Idea: Let's have a `TII->isSafeToSinkTo` callback which, in AMDGPU, checks if there are SGPR uses. If so, and the sink target is outside of a cycle that contains the SGPR def, then deny the sinking.

This is a good first start:

* It will still allow the sinking where we want it to be allowed:
  * Sinking in sink-after-control-flow.mir is allowed because there are no cycles involved
  * Sinking in machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir and .ll are allowed because the relevant SGPR def is outside of the cycle
* By contrast, sinking in machine-sink-temporal-divergence-swdev407790.ll is prevented
  * This is good for correctness
  * It is better for performance than adding the extra `v_mov`

It is not perfect, because there are potentially cases where sinking would have been beneficial:

* If adding an extra `v_mov` would allow multiple instructions to be sunk.
* If the sink candidate is more expensive than a `v_mov` (e.g., transcendental)

Now in general, we should not be relying on machine sinking that much anyway. For the most part, LLVM IR passes should clean that up. Machine sinking should only help cleanup "debris" from instruction selection. It seems unlikely to me that such debris would show up for the two cases above, so while the initial idea could perhaps be refined, I'd lean towards just not worrying about it for now.

A final thought: All of `blockPrologueInterferes` becomes wasteful as far as I can tell. Block prologue instructions are instructions that define EXEC. If we skip them because EXEC is an ignorable use, then the check isn't doing us any good. Let's just remove it.

With that in mind, can you please:

1. Change the main commit in this PR so that it adds the `TII->isSafeToSinkTo` callback and checks it at the end of FindSuccToSinkTo.
2. Add a final commit which removes all traces of `blockPrologueInterferes`. (Make it a separate commit so that it's easier to revert if it turns out I was wrong.)

https://github.com/llvm/llvm-project/pull/67456