[llvm] [AMDGPU] Fix an issue that wrong index is used in calculation of byte provider when the op is extract_vector_elt (PR #91697)
Jeffrey Byrnes via llvm-commits
llvm-commits at lists.llvm.org
Mon May 20 13:18:46 PDT 2024
jrbyrnes wrote:
> > Is this one of the tests in the file? What do you mean by regression?
>
> What was this regression? Is this one of the existing tests?
Thanks for the ping -- this got buried in emails while I was away.
By regression, I mean quality of code, not correctness. Using the example provided
llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs test.ll
git fetch origin pull/91697/head:fix-cbp
git checkout fix-cbp
(with PR)
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
global_load_dword v6, v[2:3], off offset:12
global_load_dword v7, v[0:1], off offset:4
s_movk_i32 s4, 0xff00
s_waitcnt vmcnt(1)
v_lshlrev_b16_e32 v0, 8, v6
v_and_b32_sdwa v1, v6, s4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
v_or_b32_sdwa v0, v6, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
s_waitcnt vmcnt(0)
v_or_b32_sdwa v1, v7, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
global_store_dword v[4:5], v0, off
s_waitcnt vmcnt(0)
s_setpc_b64 s[30:31]
git revert
(without PR)
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
global_load_dword v6, v[0:1], off offset:4
global_load_dword v7, v[2:3], off offset:12
s_movk_i32 s4, 0x307
s_waitcnt vmcnt(0)
v_perm_b32 v0, v6, v7, s4
global_store_dword v[4:5], v0, off
s_waitcnt vmcnt(0)
s_setpc_b64 s[30:31]
I have seen a similar pattern in existing real code (CK). I don't think it is covered in existing lit tests.
SelectionDAG has 45 nodes:
t0: ch,glue = EntryToken
t2: i32,ch = CopyFromReg # D:1 t0, Register:i32 %0
t4: i32,ch = CopyFromReg # D:1 t0, Register:i32 %1
t13: i64 = build_pair # D:1 t2, t4
t18: v4i32,ch = load<(load (s128) from %ir.in0, align 4, addrspace 1)> # D:1 t0, t13, undef:i64
t6: i32,ch = CopyFromReg # D:1 t0, Register:i32 %2
t8: i32,ch = CopyFromReg # D:1 t0, Register:i32 %3
t14: i64 = build_pair # D:1 t6, t8
t19: v4i32,ch = load<(load (s128) from %ir.in1, align 4, addrspace 1)> # D:1 t0, t14, undef:i64
t25: i32 = extract_vector_elt # D:1 t19, Constant:i32<3>
t51: i16 = truncate # D:1 t25
t29: ch = TokenFactor t18:1, t19:1
t22: i32 = extract_vector_elt # D:1 t18, Constant:i32<1>
t36: i32 = srl # D:1 t22, Constant:i32<16>
t37: i16 = truncate # D:1 t36
t42: i16 = srl # D:1 t37, Constant:i16<8>
t52: i32 = srl # D:1 t25, Constant:i32<16>
t53: i16 = truncate # D:1 t52
t89: i16 = and # D:1 t53, Constant:i16<-256>
t74: i16 = or # D:1 t42, t89
t61: i32 = zero_extend # D:1 t74
t85: i16 = and # D:1 t51, Constant:i16<255>
t82: i16 = shl # D:1 t51, Constant:i16<8>
t83: i16 = or # D:1 t85, t82
t62: i32 = any_extend # D:1 t83
t63: i32 = shl # D:1 t62, Constant:i32<16>
t64: i32 = or # D:1 t61, t63
t10: i32,ch = CopyFromReg # D:1 t0, Register:i32 %4
t12: i32,ch = CopyFromReg # D:1 t0, Register:i32 %5
t15: i64 = build_pair # D:1 t10, t12
t33: ch = store<(store (s32) into %ir.out0, addrspace 1)> # D:1 t29, t64, t15, undef:i64
t31: ch = RET_GLUE t33
With the PR, we have source bytes: t22, t25, t25, and t19. Without the PR, I see source bytes: t22, t22, t25, t25. The issue here seems to be too many source nodes, and trying to combine them is not profitable.
It seems the answer is that if we are going to look through extract_vector_elt in calculateByteProvider, we should also look them in calculateSrcByte. But, doing such may cause your crash to reappear (I don't understand how this is causing a crash?)
https://github.com/llvm/llvm-project/pull/91697
More information about the llvm-commits
mailing list