[PATCH] D124657: [AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELT

Fri Apr 29 03:30:32 PDT 2022

hsmhsm marked an inline comment as not done.
hsmhsm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:5801
+      // instead of again taking it through INSERT_VECTOR_ELT.
+      Vec = InsertVecElt(Vec, Elt, CurIdx, SL, DAG);
+    } else {
----------------
hsmhsm wrote:
> foad wrote:
> > I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything useful - it just returns SDValue().
> Hmm, I did not realize it. 
> 
> But, I was told that returning SDValue() means treat it as legal.
> 
> And, hence, no default expansion takes place which otherwise would introduce stack access?
> 
> I myself do not understand it - I need to further explore it in detail.
Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example.

```
define amdgpu_kernel void @insertelement_v2f16_0(<2 x i16> addrspace(1)* %out, <2 x i16> %a) {
  %vecins = insertelement <2 x i16> %a, i16 100, i32 0
  store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
  ret void
}
```

Here, since the index is constant, we skip custom lowering (via bit manipulation), which trigger default expansion, thereby,  INSERT_VECTOR_ELT will expand to VECTOR_SHUFFLE which builds new vector with the required element being properly inserted. In this case, we land-up efficiently selecting  S_PACK_LL_B32_B16 instruction. So the final assembly looks like below for gfx90a.

```
        s_load_dword s2, s[4:5], 0x8
        s_load_dwordx2 s[0:1], s[4:5], 0x0
        v_mov_b32_e32 v0, 0
        s_waitcnt lgkmcnt(0)
        s_pack_lh_b32_b16 s2, 0x64, s2
        v_mov_b32_e32 v1, s2
        global_store_dword v0, v1, s[0:1]
        s_endpgm
``` 

When the index is not constant, we cannot take VECTOR_SHUFFLE path, since we cannot do scalar_to_vector of dynamic index. Hence we take custom lowering via bit manipulation, and we land-up getting below assembly for gfx90a which looks bit inefficient compare to earlier one.

```
        s_load_dword s2, s[4:5], 0x8
        s_load_dwordx2 s[0:1], s[4:5], 0x0
        v_mov_b32_e32 v0, 0
        s_waitcnt lgkmcnt(0)
        s_and_b32 s2, s2, 0xffff0000
        s_or_b32 s2, s2, 0x64
        v_mov_b32_e32 v1, s2
        global_store_dword v0, v1, s[0:1]
        s_endpgm
```

Now, coming to this change w.r.t direct custom lowering of INSERT_SUBVECTOR handling, I think, I have totally missed above reasoning. I need to relook into it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124657/new/

https://reviews.llvm.org/D124657