[llvm] [AMDGPU][MC] Disallow op_sel in v_dot4 and v_dot8 with 4- or 8-bit packed data (PR #100485)

Mon Oct 7 12:48:14 PDT 2024

DadSchoorse wrote:

> However, @DadSchoorse reports some semantics on how op_sel might work on these instructions. Is that claim from experimentation on hardware, or something else?

>From experimentation. For example on gfx10.3 and gfx11 hw behavior of `v_dot4_u32_u8` is:
```
uint32_t v0x = (uint8_t)(src0 >> (opsel[0] * 16));
uint32_t v0y = (uint8_t)((src0 >> (opsel[0] * 16)) >>  8);
uint32_t v0z = (uint8_t)(src0 >> (opsel_hi[0] * 16));
uint32_t v0w = (uint8_t)((src0 >> (opsel_hi[0] * 16)) >> 8);
uint32_t v1x = (uint8_t)(src1 >> (opsel[1] * 16));
uint32_t v1y = (uint8_t)((src1 >> (opsel[1] * 16)) >>  8);
uint32_t v1z = (uint8_t)(src1 >> (opsel_hi[1] * 16));
uint32_t v1w = (uint8_t)((src1 >> (opsel_hi[1] * 16)) >> 8);
dst = (v0x * v1x) + (v0y * v1y) + (v0z * v1z) + (v0w * v1w) + src2;
```

src2 opsel is ignored, src0/1 opsel works as a 16bit swizzle. (I have no gfx9 hw to test)

The RDNA3 doc note that opsel is ignored for inline constants is not correct:

```
v_mov_b32 v2 0x04030201
v_dot4_u32_u8 v0, 1, v2, 0 op_sel:[0,0,0] op_sel_hi:[1,1,1]
// v0 is 1
v_dot4_u32_u8 v0, 1, v2, 0 op_sel:[1,0,0] op_sel_hi:[1,1,1]
// v0 is 0
v_dot4_u32_u8 v0, 1, v2, 0 op_sel:[0,0,0] op_sel_hi:[0,1,1]
// v0 is 4
v_dot4_u32_u8 v0, 1, v2, 0 op_sel:[0,1,0] op_sel_hi:[1,1,1]
// v0 is 6
```

https://github.com/llvm/llvm-project/pull/100485