[PATCH] D87556: [amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.

Fri Sep 11 22:37:08 PDT 2020

hliao added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIFoldOperands.cpp:1240-1244
+        // Skip updating literal use if it's used in the same REQ_SQUENCE as,
+        // if that literal could be inlined, it's just a single use.
+        if (NonInlineUse && NonInlineUse->getParent() == UseMI &&
+            UseMI->isRegSequence())
+          continue;
----------------
This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV, we have only one V_MOV to load that literal value. The result of that loading is re-used multiple times in the following REG_SEQUENCE. This's a change to fix wqm.ll regression test.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:105-108
+static cl::opt<bool> EnableLowerSGPRToVGPRCopy(
+    "lower-sgpr-to-vgpr-copy", cl::Hidden,
+    cl::desc("Enable lowering copy from SGPR to VGPR"), cl::init(true));
+
----------------
Add one option to disable or enable such lowering. Because this lowering may increase register live ranges as well as register pressures, some tests may have lower performance. Having this option eases the performance evaluation for suspected benchmarks.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:11357-11358
+                DstRC == &AMDGPU::VReg_64RegClass) &&
+               (SrcRC == &AMDGPU::SGPR_32RegClass ||
+                SrcRC == &AMDGPU::SGPR_64RegClass);
+      };
----------------
Only handle SGPR_(32|64) so far as SReg_(32|64) by default in instruction selection. So far operand folding performs in the pre-order direction, which makes the chain of MOV from the immediate value to SGPR and then to VGPR won't be folded efficiently. We fold the first MOV from the immediate value to SGPR into the second MOV from SGPR to VGPR. IMHO, we should change the folding order from pre-order to post-order one.

================
Comment at: llvm/test/CodeGen/AMDGPU/fabs.ll:14

-; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff
+; SI: s_bitset0_b32 s{{[0-9]+}}, 31
 ; VI: s_bitset0_b32 s{{[0-9]+}}, 31
----------------
The additional optimizations remove redundant instructions so that shrink pass could discover such patterns.

================
Comment at: llvm/test/CodeGen/AMDGPU/sgpr-copy-cse.ll:10
+; CHECK-NOT: v_mov_b32_e32 v{{[0-9]+}}, s[[PTR_HI]]
+define protected amdgpu_kernel void @t0(float addrspace(1)* %p, i32 %i0, i32 %j0, i32 %k0) {
+entry:
----------------
This is the test extracted from that benchmark.

================
Comment at: llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll:156-157
 ; GCN:        flat_load_dword
-; GCN:        s_waitcnt vmcnt(0) lgkmcnt(0){{$}}
+; GFX8_9:     s_waitcnt lgkmcnt(0){{$}}
+; GFX8_9:     s_waitcnt vmcnt(0){{$}}
+; GFX10:      s_waitcnt vmcnt(0) lgkmcnt(0){{$}}
----------------
as instructions are reduced, the schedule is slightly different from the previous one. Those 2 `s_waitcnt` could not be merged into a single one.

================
Comment at: llvm/test/CodeGen/AMDGPU/wqm.ll:653
 ; CHECK-DAG: v_mov_b32_e32 [[CTR:v[0-9]+]], 0
-; CHECK-DAG: s_mov_b32 [[SEVEN:s[0-9]+]], 0x40e00000
+; CHECK-DAG: v_mov_b32_e32 [[SEVEN:v[0-9]+]], 0x40e00000

----------------
Here's the result due to the pre-order operand folding. Previously, we have one S_MOV loading immediate value followed by COPY from SGPR to VGPR. As COPY doesn't allow immediate operand. The fold only happens on the user of that COPY. As we lowering that COPY to V_MOV, that S_MOV will be folded firstly into that V_MOV. However, that's not a good one as one VGPR is used instead of one SGPR. IMHO, the proper way to fix that is to change the pre-order operand folding into a post-order one, i.e. we always fold from the user instruction. That folding site is a better place to decide how to fold, especially if more than one operands could be folded.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D87556/new/

https://reviews.llvm.org/D87556