[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Fri Nov 6 01:17:46 PST 2020

sameerds added a comment.

In D85603#2361168 <https://reviews.llvm.org/D85603#2361168>, @jlebar wrote:

> - Will this paint us into a corner wrt CUDA, and specifically sm70+?
>
> /me summons @wash, who is probably a better person to speak to this than me.
>
> My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples.  But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.
>
> I admit I do not fully understand what the purpose of this is.  At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.
>
>   // Warning, does not seem to work on sm75
>   if (cond)
>     __syncwarp(FULL_MASK);
>   else
>     __syncwarp(FULL_MASK);
>
> but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make `cond` have different values within the warp.  So, guess not?
>
> Anyway, clearly I don't fully understand the sm70+ convergence semantics.  I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal.  Then we should also double-check that clang can in fact generate the relevant LLVM IR.

To extrapolate from Vinod's answer, I would say that we can represent sm70+ convergence semantics with this proposal. The situation seems to be covered by the examples in the section on hoisting and sinking. Consider the following example copied from the spec:

  define void @example(...) convergent {
    %entry = call token @llvm.experimental.convergence.entry()
    %data = ...
    %id = ...
    if (condition) {
      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
      ...
    }
  }

Here, hoisting subgroupShuffle() is generally disallowed because it depends on the identity of active threads. A CUDA builtin with a mask argument similarly identifies specific threads that must be active at the set of textually unaligned calls that synchronize with each other. So any change in the control flow surrounding those calls is generally disallowed without more information. The new representation doesn't seem to restrict a more informed optimizer that can predict how the threads evolve.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603