[llvm] [AMDGPU] - Generate s_bitreplicate_b64_b32 (PR #69209)

Mon Oct 23 10:19:36 PDT 2023

nhaehnle wrote:

The *intended* reasoning for why this generic transform is forbidden when `bitreplicate` is `convergent` is as follows.

Here's ConvergentOperations.rst:

> A convergent operation involves inter-thread communication or synchronization
> that occurs outside of the memory model, where the set of threads which
> participate in communication is implicitly affected by control flow.

Now it seems we maybe never explicitly state this, but what this means is that even if the oiperation is `memory(none)`, participating threads can still see all the other participating threads' function arguments.

Let's say we have two threads, `val1 = 1`, `val2 = 2`, and `cond` is false in thread 0 and true in thread 1. Then in the original program,
```llvm
  %res1 = bitreplicate(%val1)
  %res2 = bitreplicate(%val2)
  %res = select %cond, %res1, %res2
```
Thread 0 participates in two `bitreplicate`s. Both times, it sees one other thread participating. In the first bitreplicate, both threads provide the value 1; in the second, both threads provide the value 2.

In the transformed program,
```llvm
  %val = select %cond, %val1, %val2
  %res = bitreplicate(%val)
```
Thread 0 participates only in a single `bitreplicate`. That in itself is not a problem since the bitreplicate can't have side effects. But the value it provides is 1, while the other thread provides the value 2. This is something that never happened in the original program, so without further knowledge about what `bitreplicate` does, a generic transform must assume that the resulting value of `res` may be different, and therefore the transform is not allowed.

https://github.com/llvm/llvm-project/pull/69209