[PATCH] D28782: [AMDGPU] Do not allow register coalescer to create big superregs

Mon Jan 16 17:07:34 PST 2017

arsenm added a comment.

I think a little more experimentation here might be worthwhile. It's not obvious to me that this is the right heuristic. Allowing 8 or wider might be beneficial. With subregister liveness tracking I would hope that there wouldn't be much difference for 2-4 register tuples. For the larger registers I could see there being more issues.

I see a very small improvement in shader-db with this as is:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494573 (-0.00 %)
VGPRS: 941553 -> 941353 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40835224 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298015 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 19168 -> 19152 (-0.08 %)
VGPRS: 15952 -> 15752 (-1.25 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 890656 -> 894328 (0.41 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 2197 -> 2230 (1.50 %)
Wait states: 0 -> 0 (0.00 %)

If I increase the threshold to 8 I see slightly better improvements:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494549 (-0.00 %)
VGPRS: 941553 -> 941377 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40834176 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298014 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 10664 -> 10624 (-0.38 %)
VGPRS: 10624 -> 10448 (-1.66 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 627904 -> 630528 (0.42 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 1111 -> 1143 (2.88 %)
Wait states: 0 -> 0 (0.00 %)

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.cpp:1484-1486
+  unsigned SrcSize = SrcRC->getSize();
+  unsigned DstSize = DstRC->getSize();
+  unsigned NewSize = NewRC->getSize();
----------------
This isn't being used for the spill size, so this is supposed to use getRegBitWidth

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.cpp:1491-1493
+  // Always allow dword and sub-dword coalescing.
+  if (SrcSize <= 4 || DstSize <= 4)
+    return true;
----------------
We don't have sub-dword registers, so the < and comment are misleading

================
Comment at: test/CodeGen/AMDGPU/limit-coalesce.mir:53
+    FLAT_STORE_DWORDX2 killed %5, killed %vgpr0_vgpr1, 0, 0, 0, implicit %exec, implicit %flat_scr
+
+...
----------------
Can you add more tests for more register sizes?

Repository:
  rL LLVM

https://reviews.llvm.org/D28782