[llvm] [AMDGPU] Fix an issue that wrong index is used in calculation of byte provider when the op is extract_vector_elt (PR #91697)

Mon May 20 13:18:46 PDT 2024

jrbyrnes wrote:

> > Is this one of the tests in the file? What do you mean by regression?
> 
> What was this regression? Is this one of the existing tests?

Thanks for the ping -- this got buried in emails while I was away.

By regression, I mean quality of code, not correctness. Using the example provided

llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs test.ll

git fetch origin pull/91697/head:fix-cbp
git checkout fix-cbp

(with PR)

        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        global_load_dword v6, v[2:3], off offset:12
        global_load_dword v7, v[0:1], off offset:4
        s_movk_i32 s4, 0xff00
        s_waitcnt vmcnt(1)
        v_lshlrev_b16_e32 v0, 8, v6
        v_and_b32_sdwa v1, v6, s4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
        v_or_b32_sdwa v0, v6, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
        s_waitcnt vmcnt(0)
        v_or_b32_sdwa v1, v7, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
        v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
        global_store_dword v[4:5], v0, off
        s_waitcnt vmcnt(0)
        s_setpc_b64 s[30:31]

git revert

(without PR)

        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        global_load_dword v6, v[0:1], off offset:4
        global_load_dword v7, v[2:3], off offset:12
        s_movk_i32 s4, 0x307
        s_waitcnt vmcnt(0)
        v_perm_b32 v0, v6, v7, s4
        global_store_dword v[4:5], v0, off
        s_waitcnt vmcnt(0)
        s_setpc_b64 s[30:31]

I have seen a similar pattern in existing real code (CK).  I don't think it is covered in existing lit tests.

SelectionDAG has 45 nodes:
  t0: ch,glue = EntryToken
      t2: i32,ch = CopyFromReg # D:1 t0, Register:i32 %0
      t4: i32,ch = CopyFromReg # D:1 t0, Register:i32 %1
    t13: i64 = build_pair # D:1 t2, t4
  t18: v4i32,ch = load<(load (s128) from %ir.in0, align 4, addrspace 1)> # D:1 t0, t13, undef:i64
      t6: i32,ch = CopyFromReg # D:1 t0, Register:i32 %2
      t8: i32,ch = CopyFromReg # D:1 t0, Register:i32 %3
    t14: i64 = build_pair # D:1 t6, t8
  t19: v4i32,ch = load<(load (s128) from %ir.in1, align 4, addrspace 1)> # D:1 t0, t14, undef:i64
  t25: i32 = extract_vector_elt # D:1 t19, Constant:i32<3>
  t51: i16 = truncate # D:1 t25
      t29: ch = TokenFactor t18:1, t19:1
                  t22: i32 = extract_vector_elt # D:1 t18, Constant:i32<1>
                t36: i32 = srl # D:1 t22, Constant:i32<16>
              t37: i16 = truncate # D:1 t36
            t42: i16 = srl # D:1 t37, Constant:i16<8>
                t52: i32 = srl # D:1 t25, Constant:i32<16>
              t53: i16 = truncate # D:1 t52
            t89: i16 = and # D:1 t53, Constant:i16<-256>
          t74: i16 = or # D:1 t42, t89
        t61: i32 = zero_extend # D:1 t74
              t85: i16 = and # D:1 t51, Constant:i16<255>
              t82: i16 = shl # D:1 t51, Constant:i16<8>
            t83: i16 = or # D:1 t85, t82
          t62: i32 = any_extend # D:1 t83
        t63: i32 = shl # D:1 t62, Constant:i32<16>
      t64: i32 = or # D:1 t61, t63
        t10: i32,ch = CopyFromReg # D:1 t0, Register:i32 %4
        t12: i32,ch = CopyFromReg # D:1 t0, Register:i32 %5
      t15: i64 = build_pair # D:1 t10, t12
    t33: ch = store<(store (s32) into %ir.out0, addrspace 1)> # D:1 t29, t64, t15, undef:i64
  t31: ch = RET_GLUE t33

With the PR, we have source bytes: t22, t25, t25, and t19. Without the PR, I see source bytes: t22, t22, t25, t25. The issue here seems to be too many source nodes, and trying to combine them is not profitable.

It seems the answer is that if we are going to look through extract_vector_elt in calculateByteProvider, we should also look them in calculateSrcByte. But, doing such may cause your crash to reappear (I don't understand how this is causing a crash?)

https://github.com/llvm/llvm-project/pull/91697