[PATCH] D115747: [AMDGPU] Hoist waitcnt out of loops when they unecessarily wait for stores

Wed Dec 15 04:25:34 PST 2021

foad added a comment.

Here's the big question: is it REALLY necessary for shouldFlush to examine every instruction in the loop? (This is quite expensive, and could I think cause quadratic behaviour if you have lots of nested loops.) Or could we make the same decision just by examining the WaitcntBrackets at the end of the preheader compared to the WaitcntBrackets coming from the loop backedge(s)? I think this would be a much cleaner approach if it is possible, even if we have to track a bit more state in WaitcntBrackets to make it work.

Note that the loop over basic blocks in runOnMachineFunction currently looks like:

  for each block:
    get saved state for this block
    process block
    merge new state into saved state for each successor

But if necessary we could change it to:

  for each block:
    merge saved state from each predecessor
    process block
    save state for this block

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115747/new/

https://reviews.llvm.org/D115747