[PATCH] D103225: [AMDGPU] Replace non-kernel function uses of LDS globals by pointers.

Wed Jun 9 11:55:16 PDT 2021

rampitec added a comment.

In D103225#2808671 <https://reviews.llvm.org/D103225#2808671>, @rampitec wrote:

> In D103225#2808121 <https://reviews.llvm.org/D103225#2808121>, @hsmhsm wrote:
>
>> Implemented approach(2). Here we actually do not need __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u)).  Irrespective of the wave64 or wave32, _builtin_amdgcn_mbcnt_lo(~0u, 0u) is enough. The reason is - we only want to identify lane 0. On the other hand, for wave64, if we wanted to identify any lane greater than 31, then we would need __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u)).
>
> As far as I understand mbcnt_lo will return 0 for any thread >= 32, so you still need to use mbcnt_hi.
>
> @b-sumner why do you suggest to nest the hi and lo calls? I think it shall be (__builtin_amdgcn_mbcnt_lo(~0u, 0u) + __builtin_amdgcn_mbcnt_hi(~0u, 0u)) == 0.

Anyhow, manual says:

  Example to compute each thread's position in 0..63:
  v_mbcnt_lo_u32_b32 v0, -1, 0
  v_mbcnt_hi_u32_b32 v0, -1, v0
  // v0 now contains ThreadPosition

So it tells to nest it. Probably to save one instruction.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103225/new/

https://reviews.llvm.org/D103225