[PATCH] D75909: [AMDGPU] Remove the gfx10 VALU register destination cache model

Wed Mar 11 03:47:30 PDT 2020

nhaehnle added a comment.

The number of test changes is massive, but looks pretty benign and rather points at weaknesses elsewhere.

There are three purposes to the destination cache:

1. Provide some slack if write-back to the register file is contested.
2. Re-use recently computed results to reduce register read bank conflicts.
3. Re-use recently computed results to save power.

We should probably just ignore point 1, and by far the main impact should be point 3. Either way, the moral of it is that if we can schedule dependent instructions to land *exactly* after the latency of the instructions that they depend on, then that's a win -- but it's a weak one and should be deprioritized relative to almost all other concerns.

I find the mixing up of VALU and other instruction types dubious here as well, because even if they do have similar mechanisms (and other than SALU they really don't), they are *separate* mechanisms and would presumably have to be modeled via a separate resource.

The bigger problem is that I don't see how the existing modeling via HWRC models what we want in the first place, and the test changes aren't particularly encouraging from the point of view of keeping it. If there isn't a really good argument, I'd agree with Jay that erring on the side of simplicity is the right thing to do here.

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.update.dpp.ll:65-68
 ; GFX10-NEXT:    v_mov_b32_e32 v3, s1
 ; GFX10-NEXT:    v_mov_b32_e32 v2, s0
-; GFX10-NEXT:    v_mov_b32_e32 v5, s3
-; GFX10-NEXT:    v_mov_b32_e32 v4, s2
-; GFX10-NEXT:    v_add_co_u32_e64 v6, vcc_lo, v2, v0
-; GFX10-NEXT:    v_add_co_ci_u32_e32 v7, vcc_lo, v3, v1, vcc_lo
-; GFX10-NEXT:    global_load_dwordx2 v[2:3], v[6:7], off
+; GFX10-NEXT:    v_add_co_u32_e64 v0, vcc_lo, v2, v0
+; GFX10-NEXT:    v_add_co_ci_u32_e32 v1, vcc_lo, v3, v1, vcc_lo
----------------
This is a regression because the distance between dependent instructions is reduced too much. But before worrying about that, we need to worry about what looks like an unnecessary move from SGPR.

================
Comment at: llvm/test/CodeGen/AMDGPU/idot2.ll:112-115
+; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s3
+; GFX10-DL-NEXT:    s_load_dword s3, s[6:7], 0x0
+; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX10-DL-NEXT:    v_dot2_u32_u16 v2, s3, s2, v0
----------------
This is an improvement because it eliminates a stall.

================
Comment at: llvm/test/CodeGen/AMDGPU/idot2.ll:260-262
+; GFX10-DL-NEXT:    v_mul_u32_u24_e64 v0, s2, s5
+; GFX10-DL-NEXT:    s_load_dword s2, s[0:1], 0x0
+; GFX10-DL-NEXT:    v_mad_u32_u24 v0, s4, s3, v0
----------------
These should be scalar instructions in the first place, but the change eliminates a stall.

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgcn.image.sample.a16.dim.ll:789-794
 ; GFX10-NEXT:    v_and_b32_e32 v3, v8, v3
-; GFX10-NEXT:    v_lshl_or_b32 v5, v6, 16, v5
+; GFX10-NEXT:    v_and_b32_e32 v1, v8, v1
+; GFX10-NEXT:    v_and_b32_e32 v5, v8, v5
+; GFX10-NEXT:    v_lshl_or_b32 v3, v4, 16, v3
 ; GFX10-NEXT:    v_lshl_or_b32 v1, v2, 16, v1
+; GFX10-NEXT:    v_lshl_or_b32 v6, v6, 16, v5
----------------
This change just doesn't matter.

================
Comment at: llvm/test/CodeGen/AMDGPU/shrink-add-sub-constant.ll:467-470
+; GFX10-NEXT:    v_add_co_u32_e64 v2, s0, s0, v2
 ; GFX10-NEXT:    v_add_co_ci_u32_e64 v1, s2, s3, 0, s2
-; GFX10-NEXT:    global_load_dword v3, v[0:1], off
-; GFX10-NEXT:    v_add_co_u32_e64 v0, s0, s0, v2
-; GFX10-NEXT:    v_add_co_ci_u32_e64 v1, s0, s1, 0, s0
+; GFX10-NEXT:    v_add_co_ci_u32_e64 v3, s0, s1, 0, s0
+; GFX10-NEXT:    global_load_dword v0, v[0:1], off
----------------
The fact that the computation of v2 and v3 are moved earlier is a regression, but I would consider it unrelated. It rather points to the fact that the scheduler doesn't understand just how ridiculously long the latency of VMEM instructions is, and that adding those extra VALUs between it and the computation of its address doesn't actually help at all.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75909/new/

https://reviews.llvm.org/D75909