[PATCH] D115747: [AMDGPU] Hoist waitcnt out of loops when they unecessarily wait for stores

Thu Feb 17 09:31:04 PST 2022

bsaleil added a comment.

I agree that the code is a lot more complicated with this patch than it was with the previous patch. I think this improvement cannot be implemented simply by looking at the WaitcntBrackets without requiring all this refactoring.
So this means we have two choices:

1. Original patch: Before inserting the waitcnts, visit all the loop instructions a single time (no fixed-point) until we visit all the instructions, or until we find an instruction that invalidates the optimization. Depending on what we found, flush in preheaders or not before generating the waitcnts. This is a lot simpler.
2. New patch with refactoring: No need to visit the instructions before inserting the waitcnt, but we need to compute two brackets for each block, and keep two waitcnt lists until we decide which one we want to generate. This is a lot more complicated.

The original motivation to work on 2. was concerns about compile-time impact of 1., but because we need to compute two brackets for each block, I actually don't think that 2. is more efficient. Both the GFX9 and GFX10 improvements can be implemented with either 1. or 2. So for the sake of simplicity, I think I should revert back to the original patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115747/new/

https://reviews.llvm.org/D115747