[PATCH] D124657: [AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELT

Fri Apr 29 03:49:34 PDT 2022

hsmhsm marked an inline comment as not done.
hsmhsm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:5801
+      // instead of again taking it through INSERT_VECTOR_ELT.
+      Vec = InsertVecElt(Vec, Elt, CurIdx, SL, DAG);
+    } else {
----------------
hsmhsm wrote:
> hsmhsm wrote:
> > foad wrote:
> > > I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything useful - it just returns SDValue().
> > Hmm, I did not realize it. 
> > 
> > But, I was told that returning SDValue() means treat it as legal.
> > 
> > And, hence, no default expansion takes place which otherwise would introduce stack access?
> > 
> > I myself do not understand it - I need to further explore it in detail.
> Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example.
> 
> ```
> define amdgpu_kernel void @insertelement_v2f16_0(<2 x i16> addrspace(1)* %out, <2 x i16> %a) {
>   %vecins = insertelement <2 x i16> %a, i16 100, i32 0
>   store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
>   ret void
> }
> ```
> 
> Here, since the index is constant, we skip custom lowering (via bit manipulation), which trigger default expansion, thereby,  INSERT_VECTOR_ELT will expand to VECTOR_SHUFFLE which builds new vector with the required element being properly inserted. In this case, we land-up efficiently selecting  S_PACK_LL_B32_B16 instruction. So the final assembly looks like below for gfx90a.
> 
> ```
>         s_load_dword s2, s[4:5], 0x8
>         s_load_dwordx2 s[0:1], s[4:5], 0x0
>         v_mov_b32_e32 v0, 0
>         s_waitcnt lgkmcnt(0)
>         s_pack_lh_b32_b16 s2, 0x64, s2
>         v_mov_b32_e32 v1, s2
>         global_store_dword v0, v1, s[0:1]
>         s_endpgm
> ``` 
> 
> When the index is not constant, we cannot take VECTOR_SHUFFLE path, since we cannot do scalar_to_vector of dynamic index. Hence we take custom lowering via bit manipulation, and we land-up getting below assembly for gfx90a which looks bit inefficient compare to earlier one.
> 
> ```
>         s_load_dword s2, s[4:5], 0x8
>         s_load_dwordx2 s[0:1], s[4:5], 0x0
>         v_mov_b32_e32 v0, 0
>         s_waitcnt lgkmcnt(0)
>         s_and_b32 s2, s2, 0xffff0000
>         s_or_b32 s2, s2, 0x64
>         v_mov_b32_e32 v1, s2
>         global_store_dword v0, v1, s[0:1]
>         s_endpgm
> ```
> 
> Now, coming to this change w.r.t direct custom lowering of INSERT_SUBVECTOR handling, I think, I have totally missed above reasoning. I need to relook into it.
Nevertheless,  for dynamic index we need above custom lowering, otherwise,  there will be stack access as below.

```
define amdgpu_kernel void @foo(<2 x i16> addrspace(1)* %out, <2 x i16> %a, i32 %idx) {
  %vecins = insertelement <2 x i16> %a, i16 100, i32 %idx
  store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
  ret void
}
```

With default expansion of INSERT_VECTOR_ELT, we get below assembly for gfx90a.

```
        s_add_u32 s0, s0, s7
        s_load_dword s6, s[4:5], 0xc
        s_load_dword s7, s[4:5], 0x8
        s_addc_u32 s1, s1, 0
        v_mov_b32_e32 v0, 4
        s_load_dwordx2 s[4:5], s[4:5], 0x0
        s_waitcnt lgkmcnt(0)
        s_and_b32 s6, s6, 1
        v_mov_b32_e32 v1, s7
        s_lshl_b32 s6, s6, 1
        buffer_store_dword v1, off, s[0:3], 0 offset:4
        v_or_b32_e32 v0, s6, v0
        v_mov_b32_e32 v1, 0x64
        buffer_store_short v1, v0, s[0:3], 0 offen
        buffer_load_dword v0, off, s[0:3], 0 offset:4
        v_mov_b32_e32 v1, 0
        s_waitcnt vmcnt(0)
        global_store_dword v1, v0, s[4:5]
        s_endpgm
``` 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124657/new/

https://reviews.llvm.org/D124657