[PATCH] D124385: AMDGPU: Special case divergence analysis for wave ID computation

Sun May 1 14:42:41 PDT 2022

nhaehnle added a comment.

I have a bunch of questions about this.

I don't understand why / how this code is trying to compute a wave ID. My understanding is that `workitem.id.{x,y,z}` is the workgroup-local ID of a thread/lane, so the first thread always has `(0,0,0)`, the next one has `(1,0,0)` (assuming the x-size of the workgroup is at least 2) and so on. We can compute the wave ID using implicit knowledge about how the hardware unrolls the threads of a workgroup in a wave, but I would not expect to see the workgroup ID as part of the calculation. In other words, something looks very suspicious about the incoming code.

The logic of the uniformity check also looks wrong. We can't just treat X, Y, and Z as all the same.

Here's what I would expect a wave ID calculation to look like -- this uses the knowledge that the HW uses a "typewriter" unrolling of threads:

  # workitem.id.linear is LocalInvocationIndex in SPIR-V
  workitem.id.linear = workitem.id.x + workgroup.size.x * (workitem.id.y + workgroup.size.y * workitem.id.z)
  waveid = workitem.id.linear / wavesize

Also, there's a caveat when partial workgroups are used: if partial workgroups *are* used, then the workgroups "on the fringes" of the grid use a different workgroup size which must be derived by combining the base workgroup size with the grid size.

Assuming no partial workgroups, the calculation can be simplified if we know (based on kernel attributes) that `workgroup.size.x` or `(workgroup.size.x * workgroup.size.y)` are multiples of the wavesize.

To be honest, it would probably be best in the long run if we did something along the lines of:

- Introduce a bunch of new intrinsics: wave.id, workitem.id.linear, workgroup.size.{x,y,z}.
- Add InstCombine patterns to canonicalize known code patterns to these intrinsics.
- The divergence analysis improvement is now trivial.

(We would also do well to have an intrinsic as the canonical form for the lane ID instead of `mbcnt(-1)` as a first step to improving codegen for some related branching patterns, e.g. there's a bunch of code out there that branches on `laneid == 0` or `workitem.id.linear == 0`.)

================
Comment at: llvm/test/Analysis/DivergenceAnalysis/AMDGPU/wave-id-computation.ll:27
+  %i14 = add i32 %i12, %i13
+  %i15 = sdiv exact i32 %i14, 32
+  ret i32 %i15
----------------
Why is this division exact?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124385/new/

https://reviews.llvm.org/D124385