[llvm] [NVPTX] Teach NVPTX about predicates (PR #67468)

Tue Oct 3 10:52:19 PDT 2023

Artem-B wrote:

> * We don't know what `ptxas` can reliably do regarding if-conversion et al, and having the ability to control this in llvm gives us more power to affect code generation.

We can certainly observe what it does. You can dump SASS with cuobjdump and nvdisasm and the latter can even conveniently produce the control flow graph.
All I'm saying is that I'm yet to see a practical case where performance of LLVM-produced code was suboptimal due to the predication vs branches. I do regularly get to poke at various issues with LLVM-generated code, and predication/jumps are never the culprit, except for the ancient ptxas bug with the thread mis-convergence (https://bugs.llvm.org/show_bug.cgi?id=27738).

> * I _have_ seen ptxas if-convert very trivial cases in the wild, but llvm will have more information and can better reason about the control flow graph because it has more information.

Agreed, that LLVM has more info. Do you have examples where ptxas should've used predicated execution but didn't?

> * In my testing I've seen that divergent control flow can still be very expensive. 

Yes, divergent execution is expensive.

Yet, predication is not a universal win, either. For large enough branches predication will not solve the problem, and for small branches ptxas may already be doing a good enough job. 

For what it's worth, NVCC appears to prefer generating jumps, but SASS ends up using predication: https://godbolt.org/z/4f5Phd4xq

I think it's fairly safe to assume that NVIDIA would be very interested in squeezing as much performance out of the GPUs as they can. The fact that NVCC is rather conspicuously *not* using predicates in PTX, even for such an obvious case as a ternary operator, suggests that there may be a good reason for it. I'll ask them.

> NVIDIA's marketing documentation for Ampere suggests the hardware can now eliminate most of this, but the issues I'm looking at indicate that this is for simple cases only. Old hardware still suffers.

Can you point me to more details? I'm not sure I understand what you have in mind by ampere eliminating divergent branches. IIRC, Ampere allowed concurrent execution of all divergent branches (previously divergent branches ran sequentially) and thus guaranteeing progress, which was impossible on older GPUs, but I don't think it removes the concept of branch divergence.

> * I'm guessing PTX exposes generalized predicates for a reason. 

That remains to be seen. Switching to predicates just because PTX syntax allows them is not a very compelling argument, by itself. 

> Not adding them limits what we can do in the backend.

Can you be more specific about what you need to do in the back-end that can't be done without predication?

Just to be clear -- I'm not against the patch. Being able to use predicates may potentially be useful. However, it appears to be a fundamentally invasive change (both to NVPTX back-end, and to the PTX we'll generate, with potential unforeseen consequences) and I want to have a better idea of what problems it solves, what it buys us and whether the benefits outweigh the downsides.

https://github.com/llvm/llvm-project/pull/67468