[PATCH] D124657: [AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELT
Mahesha S via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Apr 29 03:49:34 PDT 2022
hsmhsm marked an inline comment as not done.
hsmhsm added inline comments.
================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:5801
+ // instead of again taking it through INSERT_VECTOR_ELT.
+ Vec = InsertVecElt(Vec, Elt, CurIdx, SL, DAG);
+ } else {
----------------
hsmhsm wrote:
> hsmhsm wrote:
> > foad wrote:
> > > I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything useful - it just returns SDValue().
> > Hmm, I did not realize it.
> >
> > But, I was told that returning SDValue() means treat it as legal.
> >
> > And, hence, no default expansion takes place which otherwise would introduce stack access?
> >
> > I myself do not understand it - I need to further explore it in detail.
> Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example.
>
> ```
> define amdgpu_kernel void @insertelement_v2f16_0(<2 x i16> addrspace(1)* %out, <2 x i16> %a) {
> %vecins = insertelement <2 x i16> %a, i16 100, i32 0
> store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
> ret void
> }
> ```
>
> Here, since the index is constant, we skip custom lowering (via bit manipulation), which trigger default expansion, thereby, INSERT_VECTOR_ELT will expand to VECTOR_SHUFFLE which builds new vector with the required element being properly inserted. In this case, we land-up efficiently selecting S_PACK_LL_B32_B16 instruction. So the final assembly looks like below for gfx90a.
>
> ```
> s_load_dword s2, s[4:5], 0x8
> s_load_dwordx2 s[0:1], s[4:5], 0x0
> v_mov_b32_e32 v0, 0
> s_waitcnt lgkmcnt(0)
> s_pack_lh_b32_b16 s2, 0x64, s2
> v_mov_b32_e32 v1, s2
> global_store_dword v0, v1, s[0:1]
> s_endpgm
> ```
>
> When the index is not constant, we cannot take VECTOR_SHUFFLE path, since we cannot do scalar_to_vector of dynamic index. Hence we take custom lowering via bit manipulation, and we land-up getting below assembly for gfx90a which looks bit inefficient compare to earlier one.
>
> ```
> s_load_dword s2, s[4:5], 0x8
> s_load_dwordx2 s[0:1], s[4:5], 0x0
> v_mov_b32_e32 v0, 0
> s_waitcnt lgkmcnt(0)
> s_and_b32 s2, s2, 0xffff0000
> s_or_b32 s2, s2, 0x64
> v_mov_b32_e32 v1, s2
> global_store_dword v0, v1, s[0:1]
> s_endpgm
> ```
>
> Now, coming to this change w.r.t direct custom lowering of INSERT_SUBVECTOR handling, I think, I have totally missed above reasoning. I need to relook into it.
Nevertheless, for dynamic index we need above custom lowering, otherwise, there will be stack access as below.
```
define amdgpu_kernel void @foo(<2 x i16> addrspace(1)* %out, <2 x i16> %a, i32 %idx) {
%vecins = insertelement <2 x i16> %a, i16 100, i32 %idx
store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
ret void
}
```
With default expansion of INSERT_VECTOR_ELT, we get below assembly for gfx90a.
```
s_add_u32 s0, s0, s7
s_load_dword s6, s[4:5], 0xc
s_load_dword s7, s[4:5], 0x8
s_addc_u32 s1, s1, 0
v_mov_b32_e32 v0, 4
s_load_dwordx2 s[4:5], s[4:5], 0x0
s_waitcnt lgkmcnt(0)
s_and_b32 s6, s6, 1
v_mov_b32_e32 v1, s7
s_lshl_b32 s6, s6, 1
buffer_store_dword v1, off, s[0:3], 0 offset:4
v_or_b32_e32 v0, s6, v0
v_mov_b32_e32 v1, 0x64
buffer_store_short v1, v0, s[0:3], 0 offen
buffer_load_dword v0, off, s[0:3], 0 offset:4
v_mov_b32_e32 v1, 0
s_waitcnt vmcnt(0)
global_store_dword v1, v0, s[4:5]
s_endpgm
```
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D124657/new/
https://reviews.llvm.org/D124657
More information about the llvm-commits
mailing list