[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Wed Mar 10 16:56:40 PST 2021

On 3/10/21 6:22 PM, Artem Belevich wrote:
> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> Right. We could keep the definition of __nv_cos and friends
>>>>>> around. Right now, -ffast-math might just crash on the user,
>>>>>> which is arguably a bad thing. I can also see us benefiting
>>>>>> in various other ways from llvm.cos uses instead of __nv_cos
>>>>>> (assuming precision is according to the user requirements but
>>>>>> that is always a condition).
>>>>>>
>>>>>> It could be as simple as introducing __nv_cos into
>>>>>> "llvm.used" and a backend matching/rewrite pass.
>>>>>>
>>>>>> If the backend knew the libdevice location it could even pick
>>>>>> the definitions from there. Maybe we could link libdevice late
>>>>>> instead of eager?
>>>>>>
>>>>> It's possible, but it would require plumbing in CUDA SDK awareness into
>>>>> LLVM. While clang driver can deal with that, LLVM currently can't. The
>>>>> bitcode library path would have to be provided by the user.
>>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would
>>>> even be fine if the middle-end does the remapping to get inlining
>>>> and folding benefits also after __nv_cos is used. See below.
>>>>
>>>>
>>>>> The standard library as bitcode raises some questions.
>>>> Which standard library? CUDAs libdevice is a bitcode library, right?
>>>>
>>> It's whatever LLVM will need to lower libcalls to. libdevice bitcode is
>> the
>>> closest approximation of that we have at the moment.
>>>
>>>
>>>>> * When do we want to do the linking? If we do it at the beginning, then
>>>> the
>>>>> question is how to make sure unused functions are not eliminated before
>>>> we
>>>>> may need them, as we don't know apriori what's going to be needed. We
>>>> also
>>>>> do want the unused functions to be gone after we're done. Linking it in
>>>>> early would allow optimizing the code better at the expense of having
>> to
>>>>> optimize a lot of code we'll throw away. Linking it in late has less
>>>>> overhead, but leaves the linked in bitcode unoptimized, though it's
>>>>> probably in the ballpark of what would happen with a real library call.
>>>>> I.e. no inlining, etc.
>>>>>
>>>>> * It incorporates linking into LLVM, which is not LLVM's job. Arguably,
>>>> the
>>>>> line should be drawn at the lowering to libcalls as it's done for other
>>>>> back-ends. However, we're also constrained to by the need to have the
>>>>> linking done before we generate PTX which prevents doing it after LLVM
>> is
>>>>> done generating an object file.
>>>> I'm confused. Clang links in libdevice.bc early.
>>> Yes. Because that's where it has to happen if we want to keep LLVM
>> unaware
>>> of CUDA SDK.
>>> It does not have to be the case if/when LLVM can do the linking itself.
>>>
>>>
>>>> If we make sure
>>>> `__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`
>>>> to `__nv_cos` which is available. After the lowering we can remove
>>>> the artificial uses of `__nv_XXX` functions that we used to keep the
>>>> definitions around in order to remove them from the final result.
>>>>
>>> This is the 'link early' approach, I should've been explicit that it's
>>> 'link early *everything*' as opposed to linking only what's needed at the
>>> beginning.
>>> It would work at the expense of having to process/optimize 500KB worth of
>>> bitcode for every compilation, whether it needs it or not.
>>>
>>>
>>>> We get the benefit of having `llvm.cos` for some of the pipeline,
>>>> we know it does not have all the bad effects while `__nv_cos` is defined
>>>> with inline assembly. We also get the benefit of inlining `__nv_cos`
>>>> and folding the implementation based on the arguments. Finally,
>>>> this should work with the existing pipeline, the linking is the same
>>>> as before, all we do is to keep the definitions alive longer and
>>>> lower `llvm.cos` to `__nv_cos` in a middle end pass.
>>>>
>>> Again, I agree that it is doable.
>>>
>>>
>>>
>>>> This might be similar to the PTX solution you describe below but I feel
>>>> we get the inline benefit from this without actually changing the
>> pipeline
>>>> at all.
>>>>
>>> So, to summarize:
>>> * link the library as bitcode early, add artificial placeholders for
>>> everything, compile, remove placeholders and DCE unused stuff away.
>>>     Pros:
>>>        - we're already doing most of it before clang hands hands off IR to
>>> LLVM, so it just pushes it a bit lower in the compilation.
>>>     Cons:
>>>        - runtime cost of optimizing libdevice bitcode,
>>>        - libdevice may be required for all NVPTX compilations?
>>>
>>> * link the library as bitcode late.
>>>      Pros:
>>>        - lower runtime cost than link-early approach.
>>>      Cons:
>>>        - We'll need to make sure that NVVMReflect pass processes the
>> library.
>>>        - less optimizations on the library functions. Some of the code
>> gets
>>> DCE'ed away after NVVMReflect and the rest could be optimized better.
>>>        - libdevice may be required for all NVPTX compilations?
>>> * 'link' with the library as PTX appended as text to LLVM's output and
>> let
>>> ptxas do the 'linking'
>>>     Pros:  LLVM remains agnostic of CUDA SDK installation details. All it
>>> does is allows lowering libcalls and leaves their resolution to the
>>> external tools.
>>>     Cons: Need to have the PTX library somewhere and need to integrate the
>>> 'linking' into the compilation process somehow.
>>>
>>> Neither is particularly good. If the runtime overhead of link-early is
>>> acceptable, then it may be a winner here, by a very small margin.
>>> link-as-PTX may be better conceptually as it keeps linking and
>> compilation
>>> separate.
>>>
>>> As for the practical steps, here's what we need:
>>> - allow libcall lowering in NVPTX, possibly guarded by a flag. This is
>>> needed for all of the approaches above.
>>> - teach LLVM how to link in bitcode (and, possibly, control early/late
>> mode)
>>> - teach clang driver to delegate libdevice linking to LLVM.
>>>
>>> This will allow us to experiment with all three approaches and see what
>>> works best.
>> I think if we embed knowledge about the nv_XXX functions we can
>> even get away without the cons you listed for early linking above.
>>
> WDYM by `embed knowledge about the nv_XXX functions`? By linking those
> functions in? Of do you mean that we should just declare them
> before/instead of linking libdevice in?
I mean by providing the "libcall lowering" pass. So the knowledge
that llvm.cos maps to __nv_cos.

>
>
>> For early link I'm assuming an order similar to [0] but I also discuss
>> the case where we don't link libdevice early for a TU.
>>
> That link just describes the steps needed to use libdevice. It does not
> deal with how/where it fits in the LLVM pipeline.
> The gist is that NVVMreflect replaces some conditionals with constants.
> libdevice uses that as a poor man's IR preprocessor, conditionally enabling
> different implementations and relying on DCE and constant folding to remove
> unused parts and eliminate the now useless branches.
> While running NVVM alone will make libdevice code valid and usable, it
> would still benefit from further optimizations. I do not know to what
> degree, though.
>
>
>> Link early:
>> 1) clang emits module.bc and links in libdevice.bc but with the
>>      `optnone`, `noinline`, and "used" attribute for functions in
>>      libdevice. ("used" is not an attribute but could as well be.)
>>      At this stage module.bc might call __nv_XXX or llvm.XXX freely
>>      as defined by -ffast-math and friends.
>>
> That could work. Just carrying extra IR around would probably be OK.
> We may want to do NVVMReflect as soon as we have it linked in and, maybe,
> allow optimizing the functions that are explicitly used already.

Right. NVVMReflect can be run twice and with `alwaysinline`
on the call sites of __nv_XXX functions we will actually
inline and optimize them while the definitions are just "dragged
along" in case we need them later.

>> 2) Run some optimizations in the middle end, maybe till the end of
>>      the inliner loop, unsure.
>> 3) Run a libcall lowering pass and another NVVMReflect pass (or the
>>      only instance thereof). We effectively remove all llvm.XXX calls
>      in favor of __nv_XXX now. Note that we haven't spend (much) time
>>      on the libdevice code as it is optnone and most passes are good
>>      at skipping those. To me, it's unclear if the used parts should
>>      not be optimized before we inline them anyway to avoid redoing
>>      the optimizations over and over (per call site). That needs
>>      measuring I guess. Also note that we can still retain the current
>>      behavior for direct calls to __nv_XXX if we mark the call sites
>>      as `alwaysinline`, or at least the behavior is almost like the
>>      current one is.
>> 4) Run an always inliner pass on the __nv_XXX calls because it is
>>      something we would do right now. Alternatively, remove `optnone`
>>      and `noinline` from the __nv_XXX calls.
>> 5) Continue with the pipeline as before.
>>
>>
> SGTM.
>
>
>> As mentioned above, `optnone` avoids spending time on the libdevice
>> until we "activate" it. At that point (globals) DCE can be scheduled
>> to remove all unused parts right away. I don't think this is (much)
>> more expensive than linking libdevice early right now.
>>
>> Link late, aka. translation units without libdevice:
>> 1) clang emits module.bc but does not link in libdevice.bc, it will be
>>      made available later. We still can mix __nv_XXX and llvm.XXX calls
>>      freely as above.
>> 2) Same as above.
>> 3) Same as above.
>> 4) Same as above but effectively a no-op, no __nv_XXX definitions are
>>      available.
>> 5) Same as above.
>>
>>
>> I might misunderstand something about the current pipeline but from [0]
>> and the experiments I run locally it looks like the above should cover all
>> the cases. WDYT?
>>
>>
> The `optnone` trick may indeed remove much of the practical differences
> between the early/late approaches.
> In principle it should work.
>
> Next question is -- is libdevice sufficient to satisfy LLVM's assumptions
> about the standard library.
> While it does provide most of the equivalents of libm functions, the set is
> not complete and some of the functions differ from their libm counterparts.
> The differences are minor, so we should be able to deal with it by
> generating few wrapper functions for the odd cases.
> Here's what clang does to provide math functions using libdevice:
> https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h

Right now, clang will generate any llvm intrinsic and we crash, so anything
else is probably a step in the right direction. Eventually, we should 
"lower"
all intrinsics that the NVPTX backend can't handle or at least emit a nice
error message. Preferably, clang would know what we can't deal with and not
generate intinsic calls for those in the first place.

>
> The most concerning aspect of libdevice is that we don't know when we'll no
> longer be able to use the libdevice bitcode? My understanding is that IR
> does not guarantee binary stability and at some point we may just be unable
> to use it. Ideally we need our own libm for GPUs.

For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
backing on CUDA and libc++ implementations, I hope it will stay that way.
That said, if the need arises we might really have to port libc++ to the
GPUs.

Back to the problem with libdevice. I agree that the solution of NVIDIA
to ship a .bc library is suboptimal but with the existing, or an extended,
auto-upgrader we might be able to make that work reasonably well for the
foreseeable future. That problem is orthogonal to what we are discussing
above, I think.

~ Johannes

>
> --Artem
>
>
>> ~ Johannes
>>
>>
>> P.S. If the rewrite capability (aka libcall lowering) is generic we could
>>        use the scheme for many other things as well.
>>
>>
>> [0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>
>>
>>> --Artem
>>>
>>>
>>>> ~ Johannes
>>>>
>>>>
>>>>> One thing that may work within the existing compilation model is to
>>>>> pre-compile the standard library into PTX and then textually embed
>>>> relevant
>>>>> functions into the generated PTX, thus pushing the 'linking' phase past
>>>> the
>>>>> end of LLVM's compilation and make it look closer to the standard
>>>>> compile/link process. This way we'd only enable libcall lowering in
>>>> NVPTX,
>>>>> assuming that the library functions will be magically available out
>>>> there.
>>>>> Injection of PTX could be done with an external script outside of LLVM
>>>> and
>>>>> it could be incorporated into clang driver. Bonus points for the fact
>>>> that
>>>>> this scheme is compatible with -fgpu-rdc out of the box -- assemble the
>>>> PTX
>>>>> with `ptxas -rdc` and then actually link with the library, instead of
>>>>> injecting its PTX before invoking ptxas.
>>>>>
>>>>> --Artem
>>>>>
>>>>> Trying to figure out a good way to have the cake and eat it too.
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>> On 3/10/21 2:49 PM, William Moses wrote:
>>>>>>> Since clang (and arguably any other frontend that uses) should link
>> in
>>>>>>> libdevice, could we lower these intrinsics to the libdevice code?
>>>>> The linking happens *before* LLVM gets to work on IR.
>>>>> As I said, it's a workaround, not the solution. It's possible for LLVM
>> to
>>>>> still attempt lowering something in the IR into a libcall and we would
>>>> not
>>>>> be able to deal with that. It happens to work well enough in practice.
>>>>>
>>>>> Do you have an example where you see the problem with -ffast-math?
>>>>>
>>>>>
>>>>>
>>>>>>> For example, consider compiling the simple device function below:
>>>>>>>
>>>>>>> ```
>>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S -emit-llvm
>>>>>>>      --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64
>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>> __device__ double f(double x) {
>>>>>>>         return cos(x);
>>>>>>> }
>>>>>>> ```
>>>>>>>
>>>>>>> The LLVM module for it is as follows:
>>>>>>>
>>>>>>> ```
>>>>>>> ...
>>>>>>> define dso_local double @_Z1fd(double %x) #0 {
>>>>>>> entry:
>>>>>>>       %__a.addr.i = alloca double, align 8
>>>>>>>       %x.addr = alloca double, align 8
>>>>>>>       store double %x, double* %x.addr, align 8
>>>>>>>       %0 = load double, double* %x.addr, align 8
>>>>>>>       store double %0, double* %__a.addr.i, align 8
>>>>>>>       %1 = load double, double* %__a.addr.i, align 8
>>>>>>>       %call.i = call contract double @__nv_cos(double %1) #7
>>>>>>>       ret double %call.i
>>>>>>> }
>>>>>>>
>>>>>>> define internal double @__nv_cos(double %a) #1 {
>>>>>>>       %q.i = alloca i32, align 4
>>>>>>> ```
>>>>>>>
>>>>>>> Obviously we would need to do something to ensure these functions
>> don't
>>>>>> get
>>>>>>> deleted prior to their use in lowering from intrinsic to libdevice.
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at google.com>
>> wrote:
>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <
>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Artem, Justin,
>>>>>>>>>
>>>>>>>>> I am running into a problem and I'm curious if I'm missing
>> something
>>>> or
>>>>>>>>> if the support is simply missing.
>>>>>>>>> Am I correct to assume the NVPTX backend does not deal with
>>>> `llvm.sin`
>>>>>>>>> and friends?
>>>>>>>>>
>>>>>>>> Correct. It can't deal with anything that may need to lower to a
>>>>>> standard
>>>>>>>> library call.
>>>>>>>>
>>>>>>>>> This is what I see, with some variations:
>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>> If this is missing in the backend, is there a plan to get this
>>>> working,
>>>>>>>>> I'd really like to have the
>>>>>>>>> intrinsics in the middle end rather than __nv_cos, not to mention
>>>> that
>>>>>>>>> -ffast-math does emit intrinsics
>>>>>>>>> and crashes.
>>>>>>>>>
>>>>>>>> It all boils down to the fact that PTX does not have the standard
>>>>>>>> libc/libm which LLVM could lower the calls to, nor does it have a
>>>>>> 'linking'
>>>>>>>> phase where we could link such a library in, if we had it.
>>>>>>>>
>>>>>>>> Libdevice bitcode does provide the implementations for some of the
>>>>>>>> functions (though with a __nv_ prefix) and clang links it in in
>> order
>>>> to
>>>>>>>> avoid generating IR that LLVM can't handle, but that's a workaround
>>>> that
>>>>>>>> does not help LLVM itself.
>>>>>>>>
>>>>>>>> --Artem
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> ~ Johannes
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> ───────────────────
>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> --Artem Belevich
>>>>>>>>
>