[llvm] [AMDGPU] - Generate s_bitreplicate_b64_b32 (PR #69209)

Mon Oct 23 07:54:04 PDT 2023

jayfoad wrote:

> > ```
> >   %res1 = bitreplicate(%val1)
> >   %res2 = bitreplicate(%val2)
> >   %res = select %cond, %res1, %res2
> > ```
> 
> When I write a test like that, I get the following assembly output:
> 
> ```
> 	s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
> 	s_bitreplicate_b64_b32 s[0:1], 0x85fe3a92          ; %res1
> 	v_dual_mov_b32 v1, s0 :: v_dual_and_b32 v0, 1, v0
> 	v_mov_b32_e32 v2, s1
> 	s_bitreplicate_b64_b32 s[2:3], 0x3a9285fe         ; %res2
> 	v_cmp_eq_u32_e32 vcc_lo, 1, v0
> 	v_cndmask_b32_e32 v0, s2, v1, vcc_lo
> 	v_cndmask_b32_e32 v1, s3, v2, vcc_lo
> 	s_setpc_b64 s[30:31]
> ```
> 
> So, in the end `v1` has the value of `%res`, right? Is that not correct?

Yes, that is fine. But I'm saying that a generic IR optimization would be _allowed_ to transform that IR into this IR (even if bitreplicate is marked as convergent):
```
  %val = select %cond, %val1, %val2
  %res = bitreplicate(%val)
```

https://github.com/llvm/llvm-project/pull/69209