[PATCH] D18340: AMDGPU: allow specifying a workgroup size that needs to fit in a compute unit

Mon Mar 21 21:06:10 PDT 2016

nhaehnle added a comment.

Thanks for looking at compute :)

I also think the function attribute is the better solution for workgroup size. We want to create the TargetMachine object only once per context (or possibly per compile thread in the future) in Mesa.

For your tests, it might be good to check the spilling stride as well, though I'm not sure how robust that is. Spill tests tend to be annoying.


================
Comment at: lib/Target/AMDGPU/AMDGPU.h:131-132
@@ -130,3 +130,4 @@
     GEOMETRY = 2,
-    COMPUTE = 3
+    COMPUTE = 3,
+    GL_COMPUTE = 4
   };
----------------
arsenm wrote:
> I think theses should be distinguished by the ABI rather than introducing a new shader type. Also semi-related, I would like to replace shader types with calling conventions
I agree, the GL_COMPUTE thing should be redundant. For the graphics shader types, there might be a very small number of use cases (e.g. the WholeQuadMode pass is completely skipped unless we're compiling a pixel shader).

================
Comment at: lib/Target/AMDGPU/SIMachineFunctionInfo.cpp:208-211
@@ -204,3 +207,6 @@
   // is available.
-  return getShaderType() == ShaderType::COMPUTE ? 256 : ST.getWavefrontSize();
+  if(getShaderType() == ShaderType::COMPUTE ||
+     getShaderType() == ShaderType::GL_COMPUTE)
+    return MaximumWorkGroupSize;
+  return ST.getWavefrontSize();
 }
----------------
I think it would be nicer to just set the MaximumWorkGroup variable to the correct value during initialization.

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.cpp:28-31
@@ +27,6 @@
+
+unsigned getMinWaveCount(const MachineFunction &MF) {
+  const SIMachineFunctionInfo& MFI = *MF.getInfo<SIMachineFunctionInfo>();
+  return (MFI.getMaximumWorkGroupSize(MF) + 255) / 256;
+}
+
----------------
I find the min/max naming here confusing. Surely this function should be named getMaxWaveCount?

Also, what about getMaxWaveCountPerSIMD?

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.cpp:43
@@ +42,3 @@
+
+  AllowedSGPRCount = (AllowedSGPRCount / MinWaveCount) & ~7;
+
----------------
VI+ rounds SGPR allocations to multiples of 16.

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.cpp:45-53
@@ +44,11 @@
+
+  if (ST.getGeneration() >= AMDGPUSubtarget::VOLCANIC_ISLANDS) {
+    if (ST.hasSGPRInitBug())
+      AllowedSGPRCount = std::min<unsigned>(AllowedSGPRCount,
+                            AMDGPUSubtarget::FIXED_SGPR_COUNT_FOR_INIT_BUG) - 6;
+    else
+      AllowedSGPRCount = std::min(AllowedSGPRCount, 102U) - 6;
+  } else
+    AllowedSGPRCount = std::min(AllowedSGPRCount, 104U) - 2;
+  return AllowedSGPRCount;
+}
----------------
As Marek has pointed out in the past, the calculations surrounding SGPR allocations are a bit questionable.

The 102/104 limit comes from the instruction encoding, which can encode 104 SGPRs in <= CIK and 102 SGPRs in >= VI. The subtraction of 2/6 comes from VCC/XNACK/FLAT_SCR - but we don't have to address those as regular SGPRs!

So on VI+, it's perfectly fine to use 102 SGPRs, you just then have to tell the hardware to reserve >= 108 slots in the SGPR register bank. Due to rounding up to multiples of 16, this ends up allocating 112 SGPRs, which does waste four slots per wave, but whatever.

In practice, what this means for your calculation is that it should be std::min(AllowedSGPRCount - 2/6, 104/102) (except for the InitBug thing, where you should indeed clamp AllowedSGPRCount **before** taking space for VCC & friends.


http://reviews.llvm.org/D18340