[PATCH] D34716: [AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions

Wed Jun 28 16:29:35 PDT 2017

cwabbott added a comment.

In https://reviews.llvm.org/D34716#794465, @nhaehnle wrote:

> So, one thing that's not clear to me with is the semantics of how the update.dpp intrinsic is supposed to enable WQM or WWM. In your sequence of instructions, if you just put a WQM/WWM flag on the update.dpp intrinsic, how does LLVM know whether the regular ALU intrinsics in between should run in WQM/WWM or not?
>
> Tim had an interesting proposal for that, which involved a pair of intrinsics:
>
> llvm.amdgcn.helpervalue(src, helpervalue) --> returns src for active lanes and helpervalue for other lanes

In https://reviews.llvm.org/D34719, I added llvm.amdgcn.set.inactive, which does exactly what you describe. I left that out of the example in my comment, but you can see it in the Mesa implementation I posted (in particular, look at ac_build_reduce(), ac_build_inclusive_scan(), and ac_build_exclusive_scan()).

> llvm.amdgcn.wwm(src) --> returns src for active lanes and undefined/poison (my choice of words, not TIm's) for other lanes, but guarantees that the computations leading to src are executed "as-if" in WWM.
> 
> llvm.amdgcn.wqm(src) --> analogous
> 
> I'm writing "as-if", because not **all** computations leading up to src actually need to be in WWM: llvm.amdgcn.helpervalue can act as a "barrier" to the propagation of WWM. So if you think of the graph of WWM computations, .helpervalue acts as a source, and .wwm acts as a sink.

Hmm, this might be an interesting approach. I think that setting wqm_ctrl to WQM on a DPP instruction is essentially equivalent to caling llvm.amdgcn.wqm on the result and then replacing all uses with the result of llvm.amdgcn.wqm (and similarly for llvm.amdgcn.wwm). I can see how having a separate pseudo-instruction might be a little cleaner though. And it would be nice for us to stop pretending that we can figure out what needs WWM/WQM based on the instruction itself, since it does very much depend on what you're using the instruction and what the API demands.

One thing that strikes me is that while your definition is sufficient for WWM, it isn't for WQM -- for derivatives, GL says that we actually do have to care about the values of things in helper invocations. The program has to behave as if it's always in WQM, except for loads and stores, so just assuming that helper lanes are undefined/poison isn't valid as long as loads from memory aren't involved. I think we can just strengthen the definition of llvm.amdgcn.wqm a little, to say that helper lanes must have the correct value as if everything was computed in WQM.

Also, with these two intrinsics, we still wouldn't be able to express that some computation must happen in exact mode. This matters for DPP instructions and store instructions with side effects. I'm not sure if we'll ever want to use DPP instructions in exact mode, but we definitely need to care about store instructions. I guess we can just keep the current logic for making sure that stores are executed in exact mode, although it certainly seems kinda hack-ish, especially if the goal is to get rid of special assumptions about instructions needing WQM/WWM/Exact.

> I think this proposal goes a long way towards clarifying which operations actually need WQM/WWM. One issue that occurred to me today is that the semantics are unclear when control flow is involved. Two basic examples to think about:
> 
>   v = some computation
>   if (cond) {
>      t1 = f(v)
>      r1 = wwm(t1)
>   } else {
>      t2 = f(v)
>      r2 = wwm(t2)
>   }
> 
> 
> I believe the desirable semantics here are clear, though they may require some compiler work. Basically, you want the entire vector of v be equal at the start of both blocks. This requires ensuring that no part of it gets overwritten during the first block we go through.

I think the extra edge will guarantee that that's the case already. And we certainly already do have similar problems with WQM, where you have to consider v live during the first block in case some WQM operation clobbers it.

> The much more problematic case is:
> 
>   if (cond) {
>     v1 = ...
>   } else {
>     v2 = ...
>   }
>   v = wwm(phi(v1, v2))
> 
> 
> What does v look like? Specifically, what's in the inactive lanes? Perhaps the best thing we can do is say that the active lanes come from the predecessor block they went through, and all the other lanes come from one of the two blocks, though it is undefined which one.

If you take the "as-if" semantics at heart, then the inactive lanes should have the value they would have if the whole program were executed in WWM -- that is, the block they come from should depend on what the value of cond would be if we executed the entire thing in WWM. In fact, if you replace "WWM" with "WQM" everywhere, then GL already mandates this behavior, and we implement it in the existing WQM pass. I chose not to implement it in WWM, since we're only ever generating WWM things ourselves with a matching llvm.amdgcn.set.inactive that tightly contains the "WWM-ness", and I doubt we'll ever need to care about these types of examples.

https://reviews.llvm.org/D34716