[PATCH] D146523: [AMDGPU]: Add new intrinsic llvm.amdgcn.convergent.copy

Wed Mar 22 06:12:30 PDT 2023

foad added a comment.

In D146523#4212482 <https://reviews.llvm.org/D146523#4212482>, @foad wrote:

> In D146523#4212396 <https://reviews.llvm.org/D146523#4212396>, @pravinjagtap wrote:
>
>> In D146523#4210812 <https://reviews.llvm.org/D146523#4210812>, @foad wrote:
>>
>>> I don't think this patch solves any real problem, it just raises a bunch of questions about what you're trying to do.
>>>
>>> If you want to read values from inactive lanes of a vgpr robustly then you need something like WWM - but I guess you don't trust the WWM implementation, so you're back to square one.
>>>
>>> However... why are you doing readlane //inside// the part of the code that only has a single active lane? You can write a reduction across all active lanes like this without changing EXEC. (This is the unrolled version but you can put it in a loop if you want; that's irrelevant.)
>>>
>>>   s_mov s0, 0 ; initialize accumulator
>>>   ; conditionally add in lane 0
>>>   v_readlane s1, v0, 0
>>>   s_bitcmp1 exec, 0
>>>   s_cselect s1, s1, 0
>>>   s_add s0, s0, s1
>>>   ; conditionally add in lane 1
>>>   v_readlane s1, v0, 1
>>>   s_bitcmp1 exec, 1
>>>   s_cselect s1, s1, 0
>>>   s_add s0, s0, s1
>>>   ...
>>>   ; conditionally add in lane 31
>>>   v_readlane s1, v0, 31
>>>   s_bitcmp1 exec, 31
>>>   s_cselect s1, s1, 0
>>>   s_add s0, s0, s1
>>>   ; result is in s0
>>>
>>> You should be able to generate code like this from regular IR using the readlane instrinsic, which is already marked as Convergent. Once you've done the reduction you can do your atomic operation with only one lane active by generating regular IR like the AtomicOptimizer pass does.
>>
>> Hello Jay, Can I extend this ISA as follows to store the intermediate results of scan for each active lane using writelane? There will be cases where we may need this intermediate scan results of each active lane later in the kernel.
>>
>>   s_mov s0, 0 ; initialize accumulator
>>   v_mov v1, 0 ; initialize a vgpr to store scan results of each active lane
>>   ; conditionally add in lane 0
>>   v_readlane s1, v0, 0
>>   s_bitcmp1 exec, 0
>>   s_cselect s1, s1, 0
>>   s_add s0, s0, s1
>>   v_writelane v1, s0, 0
>>   ; conditionally add in lane 1
>>   v_readlane s1, v0, 1
>>   s_bitcmp1 exec, 1
>>   s_cselect s1, s1, 0
>>   s_add s0, s0, s1
>>   v_writelane v1, s0, 1
>>   ...
>>   ; conditionally add in lane 31
>>   v_readlane s1, v0, 31
>>   s_bitcmp1 exec, 31
>>   s_cselect s1, s1, 0
>>   s_add s0, s0, s1
>>   v_writelane v1, s0, 31
>
> Yes, that seems reasonable to me.

Actually on second thougths, that will write a value to inactive lanes of v1, so it seems dangerous. I think you might have to use a conditional branch to skip the writelane for inactive lanes.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146523/new/

https://reviews.llvm.org/D146523