[PATCH] D115747: [AMDGPU] Hoist waitcnt out of loops when they unecessarily wait for stores

Fri Dec 17 07:47:53 PST 2021

nhaehnle added a comment.

I quite agree with Jay's point about looking at waitcnt brackets. In addition, this isn't a vmcnt/vscnt-specific issue. Consider a loop like this:

  y = load(...);
  x = load(...);
  loop {
    // (A)
    use(x);
    ... lots of code ...
    // (B)
    use(y);
    y = load(...)
  }

The current algorithm inserts an `s_waitcnt vmcnt(0)` in (A). It would be better to insert an `s_waitcnt vmcnt(0)` in the preheader and at (B) instead.

It's not obvious to me what the best approach is here.

A brute force approach is to rotate the overall processing loop in the way Jay suggests. At loop headers, instead of merging all predecessors immediately we separately merge incoming edges into the loop and latch edges, and then process the basic block for both brackets in parallel and compare. Make a decision at the end of the block or when it becomes clear that merging all predecessors with a "flush" in the preheader would force insertion of an overly aggressive s_waitcnt. Except that this approach could be fooled by control flow inside of the loop, so it's certainly not perfect.

An alternative idea is to record, for each GPR, whether it is live and if so, how many instructions (clocks?) until its first use. Then compare that against the scores in the brackets to figure out whether merging the preheader bracket would cause earlier s_waitcnt instructions. This determination could account for control flow internal to the loop, but of course it then becomes yet another fixed-point iteration, so...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115747/new/

https://reviews.llvm.org/D115747