Missing TargetPrefix for NVVM intrinsics

Wed Jul 6 13:30:58 PDT 2016

Justin Holewinski <jholewinski at nvidia.com> writes:
> On 7/6/2016 1:49 PM, Justin Lebar wrote:
>>> Subject: [PATCH 3/4] NVPTX: Remove the legacy ptx intrinsics
>>
>> Looks like we're losing a bunch of intrinsics here.  Some of them like
>> reading tid.w I don't care about (does that even work??).  But most of
>> the others seem to be providing real functionality that afaict isn't
>> available elsewhere.  Why are we removing them?
>
> Any llvm.ptx intrinsic should have a matching llvm.nvvm intrinsic. If
> not, then its a bug.

I've gone ahead and committed the ptx_shfl -> nvvm_shfl patches and the
syncthread patch in r274662-r274664. I'll wait on the ones that drop the
ptx intrinsics and builtins until I hear back about the ones that don't
seem to have an nvvm equivalent.

For completeness, here's the list:

__builtin_ptx_read_laneid
__builtin_ptx_read_warpid
__builtin_ptx_read_nwarpid
__builtin_ptx_read_smid
__builtin_ptx_read_nsmid
__builtin_ptx_read_gridid
__builtin_ptx_read_lanemask_eq
__builtin_ptx_read_lanemask_le
__builtin_ptx_read_lanemask_lt
__builtin_ptx_read_lanemask_ge
__builtin_ptx_read_lanemask_gt
__builtin_ptx_read_clock
__builtin_ptx_read_clock64
__builtin_ptx_read_pm0
__builtin_ptx_read_pm1
__builtin_ptx_read_pm2
__builtin_ptx_read_pm3
__builtin_ptx_bar_sync

>>> Subject: [PATCH 1/2] NVPTX: Rename __builtin_ptx_shfl -> __nvvm_shfl
>> Please run clang-format over the changes to __clang_cuda_intrinsics.h,
>> if you wouldn't mind.
>>
>>> Subject: [PATCH 3/4] NVPTX: Remove the legacy ptx intrinsics
>> Looks like we're losing a bunch of intrinsics here.  Some of them like
>> reading tid.w I don't care about (does that even work??).  But most of
>> the others seem to be providing real functionality that afaict isn't
>> available elsewhere.  Why are we removing them?
>>
>> On Wed, Jul 6, 2016 at 4:32 AM, Justin Holewinski
>> <jholewinski at nvidia.com> wrote:
>>> Yeah, this is probably better handled on llvm-commits. Swapping mailing
>>> lists. Any particular reason these aren't up on Phabricator?
>>>
>>> Overall, the changes look good. A few comments:
>>>
>>>
>>> --- a/lib/Target/NVPTX/NVPTXInstrInfo.cpp
>>> +++ b/lib/Target/NVPTX/NVPTXInstrInfo.cpp
>>> @@ -112,8 +112,6 @@ bool NVPTXInstrInfo::isStoreInstr(const MachineInstr
>>> &MI,
>>>
>>>   bool NVPTXInstrInfo::CanTailMerge(const MachineInstr *MI) const {
>>>     unsigned addrspace = 0;
>>> -  if (MI->getOpcode() == NVPTX::INT_CUDA_SYNCTHREADS)
>>> -    return false;
>>>     if (isLoadInstr(*MI, addrspace))
>>>       if (addrspace == NVPTX::PTXLdStInstCode::SHARED)
>>>
>>> This should be replaced with the bar0 intrinsic. Looks like a bug that bar0
>>> is missing anyway.
>>>
>>> --- a/include/clang/Basic/BuiltinsNVPTX.def
>>> +++ b/include/clang/Basic/BuiltinsNVPTX.def
>>> @@ -15,50 +15,21 @@
>>>   // The format of this database matches clang/Basic/Builtins.def.
>>>
>>>   // Builtins retained from previous PTX back-end
>>>
>>> I would just remove the comment about being retained from the previous
>>> back-end. The intent is to remove all legacy compatibility.
>>>
>>> --- a/test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
>>> +++ b/test/Analysis/DivergenceAnalysis/NVPTX/diverge.ll
>>> @@ -93,20 +93,20 @@ merge:
>>>   ; int i = 0;
>>>   ; do {
>>>   ;   i++;                  // i here is uniform
>>> -; } while (i < laneid);
>>> +; } while (i < tid);
>>>   ; return i == 10 ? 0 : 1; // i here is divergent
>>>   ;
>>>   ; The i defined in the loop is used outside.
>>>   define i32 @loop() {
>>>   ; CHECK-LABEL: Printing analysis 'Divergence Analysis' for function 'loop'
>>>   entry:
>>> -  %laneid = call i32 @llvm.ptx.read.laneid()
>>> +  %tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
>>>     br label %loop
>>>   loop:
>>>     %i = phi i32 [ 0, %entry ], [ %i1, %loop ]
>>>   ; CHECK-NOT: DIVERGENT: %i =
>>>     %i1 = add i32 %i, 1
>>> -  %exit_cond = icmp sge i32 %i1, %laneid
>>> +  %exit_cond = icmp sge i32 %i1, %tid
>>>     br i1 %exit_cond, label %loop_exit, label %loop
>>>   loop_exit:
>>>     %cond = icmp eq i32 %i, 10
>>>
>>> What is the intent of the change from laneid to tid?
>>>
>>>
>>>
>>>
>>> On 7/5/2016 5:56 PM, Justin Bogner wrote:
>>>> Justin Lebar <jlebar at google.com> writes:
>>>>>> @jlebar, any issues on your end with removing all of the non-nvvm
>>>>>> intrinsics?
>>>>> I don't think so, if the functionality is all there from the nvvm
>>>>> intrinsics.  Some if it may involve clang changes, which I'm not
>>>>> exactly volunteering to do, but they should be simple.
>>>> Okay, so if I've understood you both correctly, the attached six patches
>>>> should do the trick, for both LLVM and clang. First, they rename the
>>>> shfl intrinsics from __builtin_ptx_shfl to __nvvm_shfl. Second,
>>>> they replace uses of cuda.syncthreads with nvvm.barrier0. Third, all of
>>>> the ptx intrinsics and builtins are removed. Finally, I wrap the nvvm
>>>> intrinsics with a TargetPrefix.
>>>>
>>>> Let me know if you want me to break any of these out into specific
>>>> review threads on -commits.
>>>>
>>>
>>> -----------------------------------------------------------------------------------
>>> This email message is for the sole use of the intended recipient(s) and may
>>> contain
>>> confidential information.  Any unauthorized review, use, disclosure or
>>> distribution
>>> is prohibited.  If you are not the intended recipient, please contact the
>>> sender by
>>> reply email and destroy all copies of the original message.
>>> -----------------------------------------------------------------------------------