[PATCH] D134433: [AMDGPU][GISel] Enable Matching of V2S16 G_BUILD_VECTOR

Thu Sep 29 07:50:41 PDT 2022

Petar.Avramovic added inline comments.

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll:9
 define amdgpu_ps i16 @extractelement_sgpr_v4i16_sgpr_idx(<4 x i16> addrspace(4)* inreg %ptr, i32 inreg %idx) {
-; GCN-LABEL: extractelement_sgpr_v4i16_sgpr_idx:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    s_load_dwordx2 s[0:1], s[2:3], 0x0
-; GCN-NEXT:    s_lshr_b32 s2, s4, 1
-; GCN-NEXT:    s_cmp_eq_u32 s2, 1
-; GCN-NEXT:    s_waitcnt lgkmcnt(0)
-; GCN-NEXT:    s_cselect_b32 s0, s1, s0
-; GCN-NEXT:    s_and_b32 s1, s4, 1
-; GCN-NEXT:    s_lshl_b32 s1, s1, 4
-; GCN-NEXT:    s_lshr_b32 s0, s0, s1
-; GCN-NEXT:    ; return to shader part epilog
+; GFX9-LABEL: extractelement_sgpr_v4i16_sgpr_idx:
+; GFX9:       ; %bb.0:
----------------
arsenm wrote:
> Pierre-vh wrote:
> > arsenm wrote:
> > > Pierre-vh wrote:
> > > > arsenm wrote:
> > > > > Pierre-vh wrote:
> > > > > > foad wrote:
> > > > > > > Pierre-vh wrote:
> > > > > > > > foad wrote:
> > > > > > > > > Using buffer_store/load instead of shifts for non-constant indices is a big regression.
> > > > > > > > This seems to be coming from the change in the legalizer of G_EXTRACT_VECTOR_ELT at line 1445. Reverting it fixes it.
> > > > > > > > It's because with that change, it gets lowered to the stack, so a G_STORE is emitted and it causes what we're seeing here.
> > > > > > > > 
> > > > > > > > However if I revert the change, it becomes impossible to match mad_mix in `v_mad_mix_v2f32_clamp_postcvt_lo`, its G_INSERT_VECTOR_ELT/G_EXTRACT_VECTOR_ELT get lowered to the point that we can't even match the CLAMP.
> > > > > > > > 
> > > > > > > > Perhaps I can add a condition to not bitcast if it's a 2x16 vector (`vectorWiderThan(VecTypeIdx, 32)`) ? It works, but seems like a bandaid fix though, no?
> > > > > > > > @arsenm thoughts?
> > > > > > > I would hope that it only gets lowered to memory accesses if the index is //non-constant//, and for matching mad_mix you are only interested in //constant// indices. Does that give you a way to fix both cases?
> > > > > > `LegalityQuery` does not seem to contain any information about whether the indice is constant unfortunately, so it doesn't look like I can make that distinction in the legalizer unless I rewrite some of the legalization logic in a custom handler. I'd like to get @arsenm 's input first before implementing that though
> > > > > > 
> > > > > > > I would hope that it only gets lowered to memory accesses if the index is non-constant
> > > > > > 
> > > > > > That seems a bit orthogonal to the current issue though, the legalizer currently doesn't make a distinction based on the const-ness of the indice (I think) so maybe it's best to not start taking it into account here, as it may have other implications. I feel like the `vectorWiderThan(VecTypeIdx, 32)` solution would be fine for now even if it's a bit hacky
> > > > > > I'll take another look at the situation tomorrow and see if I can come up with another fix
> > > > > Legality cannot be context dependent. You can't have some cases be legal or not based on regular value operands.
> > > > > 
> > > > > I think treating the legalization different isn't the right strategy. In the DAG path we have combines to turn 16-bit vector operations into 32-bit vector operations with some shifts to extract of the elements. With optimizations, you wouldn't really ever use the 16->32 vector path
> > > > So the latest changes I made to the legalizer are good, and the remaining changes need to be in a pre-legalizer combine for V2S16 INSERT/EXTRACT_VECTOR_ELT ?
> > > > 
> > > So, the combine I was thinking of seems to be already done as the bitcast lowering for packed vectors (as is done in the old code). You should continue to use the bitcast lowering for vectors. You only want G_BUILD_VECTOR to be legal specifically for <2 x s16>
> > G_BUILD_VECTOR legalization isn't a pain point for now, the current rules seem to be working but I can surely restrict it to 2x16 only.
> > The current problem is with G_EXTRACT_VECTOR_ELT/G_INSERT_VECTOR_ELT, for matching the mad_mix stuff they need to be lowered (`.lowerFor({V2S16, S16})` does the trick), but it introduces huge regressions in `insertelement.i16` where some bitwise ops become stack operations with load/stores/etc. What can be done there?
> Yes, lower for these uses the stack (We could also add a compare and select path). The 16-bit vectors should continue to use the bitcast lowering (I'm not actually sure why you are touching these)
There are too many changes in this patch. I am stuck on legalizer changes.
Maybe something from this stack of patches D109242 would be helpful. 
Can you add test with G_EXTRACT_VECTOR_ELT/G_INSERT_VECTOR_ELT you mentioned and how you would like that to look like?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134433/new/

https://reviews.llvm.org/D134433