[PATCH] D152033: TTI: Pass function to hasBranchDivergence in a few passes

Tue Jun 6 06:13:22 PDT 2023

sameerds accepted this revision.
sameerds added a comment.
This revision is now accepted and ready to land.

LGTM, provided @arsenm agrees with the comments about the speculative execution pass.

================
Comment at: llvm/test/Transforms/SpeculativeExecution/single-lane-execution.ll:14
+; CHECK: a:
+; CHECK: %x = add i32 2, 3
+a:
----------------
tra wrote:
> sameerds wrote:
> > sameerds wrote:
> > > arsenm wrote:
> > > > sameerds wrote:
> > > > > Shouldn't this have been moved to the entry block??
> > > > No, the point is it wasn’t because it’s acting like a non divergent target.
> > > > 
> > > > The spec-exec-only-if-divergent-target flag doesn’t really make sense to me though
> > > From the pass implementation itself, it seems this pass was introduced specifically for "targets where branches are expensive", especially GPUs. But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? If it is the former, then I am guessing it doesn't matter if only a single thread is running; the branch on a GPU is still expensive. If that is correct, then for this one optimization modelling a single thread as a "non-divergent target" is not useful, and we should always speculate if the raw target has divergence.
> > Oh, there's more in the implementation. It talks about how speculating a load is beneficial when the appropriate addressing mode is not available in the hardware. So essentially this pass is trying to help with hardware that does not have the usual CPU-like power, but approximating this as "target has divergence". It's not about divergence at all, but weak hardware typically found in GPUs.
> > But does this cost come from the branch instruction itself, or the EXEC masking that we have to do around divergent branches? 
> 
> Speaking for NVPTX back-end here.
> 
> Uniform branches are relatively expensive, but not prohibitively so (e.g. for small conditional blocks using predicated execution may be faster).
> Divergent branches, on the other hand effectively serialize excution across threads in a warp and can result in almost two orders of magnitude slowdowns. We also must keep control flow structured around divergent branches to allow the threads to re-converge at some point. When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize.
> 
> Potentially divergent branches will result in additional glue code to assist with scheduling execution and reconvergence of divergent threads, which will be more expensive even if we never actually diverge at runtime. Knowing that some code path never diverges allows using `bra.uni` which is just a branch w/o re-convergence glue and is cheaper.
> 
> I assume AMDGPU behaves similarly.
> 
> 
> When we know that only one thread is running, then there's no possibility for any branch to diverge and that is equivalent to "we don't care about divergence here" which should give LLVM more freedom to optimize.

I assume this means that when we know that only a single thread is running, all the optimizations that this pass exposes (like working around the lack of an addressing mode with offset calculations) is also possible with the rest of LLVM. In that case, it should be okay to disable this when the launch size is known to be 1.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152033/new/

https://reviews.llvm.org/D152033