[PATCH] D83674: [AMDGPU] Calculate minimum allowed occupancy based on threads per lane

Thu Jul 16 13:05:40 PDT 2020

rampitec added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp:200
+  // - 16 waves per SIMD for GFX10 wave32
+  unsigned MinOccupancy = ST.getTotalNumVGPRs() / 64;
+  return std::min(Occupancy, MinOccupancy);
----------------
nhaehnle wrote:
> rampitec wrote:
> > nhaehnle wrote:
> > > rampitec wrote:
> > > > This is not expressable in a number of VGPRs. The limiter is about memory and not about register budget at all. Meanwhile it looks like 8 waves might be more reasonable for wave32, but wave64 shall still use the same value across the targets, because at the end of the day it boils down to a number of concurrent memory operations.
> > > I agree that using number of VGPRs here is an awkward roundabout way of doing it. The point isn't really about VGPRs, it's about how many threads are alive per lane in order to allow a good mix of VALU and VMEM instruction issue in the face of certain (IMHO underspecified) memory pressure situations.
> > > 
> > > I also disagree that 8 is the right answer for Wave32, for precisely the same reason: it is about the number of concurrent memory operations. 16x Wave32 gives you the same amount of memory operations per CU (2 SIMD * 16 waves * 32 = 1024 workitems) as 4 waves on <=gfx9 (4 SIMD * 4 waves * 64 = 1024 workitems).
> > I can understand the logic behind it, but using of TotalNumVGPRs() is not really what we want here. It probably needs to use getMaxWavesPerEU() and wavefront size to estimate a number of threads in flight. Besides the same logic will give answer 16 even on gfx1030 where 16 is the absolute maximum. The latter does not sound right.
> I agree that gfx10 can probably drop the requirement from the 16 that would be equivalent to gfx9. I've heard numbers in the 10-12 range being thrown around as where dropping lower than that hurts performance. On gfx9, the consensus seemed to be 4 waves for the same kind of folklore. I don't think we really have enough experience and thorough performance studies with gfx10 yet to decide this finally. Maybe it should be a cl::opt... (**not** a function attribute, since this is specifically about the default heuristic).
Probably, and I do not think we can use simple divide x by y logic here.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D83674/new/

https://reviews.llvm.org/D83674