<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <<a href="mailto:johannesdoerfert@gmail.com">johannesdoerfert@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

On 3/10/21 4:38 PM, Artem Belevich wrote:<br>

> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <<br>

> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

><br>

>> On 3/10/21 3:25 PM, Artem Belevich wrote:<br>

>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <<br>

>>> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

>>><br>

>>>> Right. We could keep the definition of __nv_cos and friends<br>

>>>> around. Right now, -ffast-math might just crash on the user,<br>

>>>> which is arguably a bad thing. I can also see us benefiting<br>

>>>> in various other ways from llvm.cos uses instead of __nv_cos<br>

>>>> (assuming precision is according to the user requirements but<br>

>>>> that is always a condition).<br>

>>>><br>

>>>> It could be as simple as introducing __nv_cos into<br>

>>>> "llvm.used" and a backend matching/rewrite pass.<br>

>>>><br>

>>>> If the backend knew the libdevice location it could even pick<br>

>>>> the definitions from there. Maybe we could link libdevice late<br>

>>>> instead of eager?<br>

>>>><br>

>>> It's possible, but it would require plumbing in CUDA SDK awareness into<br>

>>> LLVM. While clang driver can deal with that, LLVM currently can't. The<br>

>>> bitcode library path would have to be provided by the user.<br>

>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would<br>

>> even be fine if the middle-end does the remapping to get inlining<br>

>> and folding benefits also after __nv_cos is used. See below.<br>

>><br>

>><br>

>>> The standard library as bitcode raises some questions.<br>

>> Which standard library? CUDAs libdevice is a bitcode library, right?<br>

>><br>

> It's whatever LLVM will need to lower libcalls to. libdevice bitcode is the<br>

> closest approximation of that we have at the moment.<br>

><br>

><br>

>>> * When do we want to do the linking? If we do it at the beginning, then<br>

>> the<br>

>>> question is how to make sure unused functions are not eliminated before<br>

>> we<br>

>>> may need them, as we don't know apriori what's going to be needed. We<br>

>> also<br>

>>> do want the unused functions to be gone after we're done. Linking it in<br>

>>> early would allow optimizing the code better at the expense of having to<br>

>>> optimize a lot of code we'll throw away. Linking it in late has less<br>

>>> overhead, but leaves the linked in bitcode unoptimized, though it's<br>

>>> probably in the ballpark of what would happen with a real library call.<br>

>>> I.e. no inlining, etc.<br>

>>><br>

>>> * It incorporates linking into LLVM, which is not LLVM's job. Arguably,<br>

>> the<br>

>>> line should be drawn at the lowering to libcalls as it's done for other<br>

>>> back-ends. However, we're also constrained to by the need to have the<br>

>>> linking done before we generate PTX which prevents doing it after LLVM is<br>

>>> done generating an object file.<br>

>> I'm confused. Clang links in libdevice.bc early.<br>

><br>

> Yes. Because that's where it has to happen if we want to keep LLVM unaware<br>

> of CUDA SDK.<br>

> It does not have to be the case if/when LLVM can do the linking itself.<br>

><br>

><br>

>> If we make sure<br>

>> `__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`<br>

>> to `__nv_cos` which is available. After the lowering we can remove<br>

>> the artificial uses of `__nv_XXX` functions that we used to keep the<br>

>> definitions around in order to remove them from the final result.<br>

>><br>

> This is the 'link early' approach, I should've been explicit that it's<br>

> 'link early *everything*' as opposed to linking only what's needed at the<br>

> beginning.<br>

> It would work at the expense of having to process/optimize 500KB worth of<br>

> bitcode for every compilation, whether it needs it or not.<br>

><br>

><br>

>> We get the benefit of having `llvm.cos` for some of the pipeline,<br>

>> we know it does not have all the bad effects while `__nv_cos` is defined<br>

>> with inline assembly. We also get the benefit of inlining `__nv_cos`<br>

>> and folding the implementation based on the arguments. Finally,<br>

>> this should work with the existing pipeline, the linking is the same<br>

>> as before, all we do is to keep the definitions alive longer and<br>

>> lower `llvm.cos` to `__nv_cos` in a middle end pass.<br>

>><br>

> Again, I agree that it is doable.<br>

><br>

><br>

><br>

>> This might be similar to the PTX solution you describe below but I feel<br>

>> we get the inline benefit from this without actually changing the pipeline<br>

>> at all.<br>

>><br>

> So, to summarize:<br>

> * link the library as bitcode early, add artificial placeholders for<br>

> everything, compile, remove placeholders and DCE unused stuff away.<br>

>    Pros:<br>

>       - we're already doing most of it before clang hands hands off IR to<br>

> LLVM, so it just pushes it a bit lower in the compilation.<br>

>    Cons:<br>

>       - runtime cost of optimizing libdevice bitcode,<br>

>       - libdevice may be required for all NVPTX compilations?<br>

><br>

> * link the library as bitcode late.<br>

>     Pros:<br>

>       - lower runtime cost than link-early approach.<br>

>     Cons:<br>

>       - We'll need to make sure that NVVMReflect pass processes the library.<br>

>       - less optimizations on the library functions. Some of the code gets<br>

> DCE'ed away after NVVMReflect and the rest could be optimized better.<br>

>       - libdevice may be required for all NVPTX compilations?<br>

> * 'link' with the library as PTX appended as text to LLVM's output and let<br>

> ptxas do the 'linking'<br>

>    Pros:  LLVM remains agnostic of CUDA SDK installation details. All it<br>

> does is allows lowering libcalls and leaves their resolution to the<br>

> external tools.<br>

>    Cons: Need to have the PTX library somewhere and need to integrate the<br>

> 'linking' into the compilation process somehow.<br>

><br>

> Neither is particularly good. If the runtime overhead of link-early is<br>

> acceptable, then it may be a winner here, by a very small margin.<br>

> link-as-PTX may be better conceptually as it keeps linking and compilation<br>

> separate.<br>

><br>

> As for the practical steps, here's what we need:<br>

> - allow libcall lowering in NVPTX, possibly guarded by a flag. This is<br>

> needed for all of the approaches above.<br>

> - teach LLVM how to link in bitcode (and, possibly, control early/late mode)<br>

> - teach clang driver to delegate libdevice linking to LLVM.<br>

><br>

> This will allow us to experiment with all three approaches and see what<br>

> works best.<br>

<br>

I think if we <span class="gmail_default" style="font-family:verdana,sans-serif"></span>embed knowledge about the nv_XXX functions we can<br>

even get away without the cons you listed for early linking above.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">WDYM by `<span class="gmail_default"></span><span style="font-family:Arial,Helvetica,sans-serif">embed knowledge about the nv_XXX functions</span>`? By linking those functions in? Of do you mean that we should just declare them before/instead of linking libdevice in?</div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

For early link I'm assuming an order similar to [0] but I also discuss<br>

the case where we don't link libdevice early for a TU.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">That link just describes the steps needed to use libdevice. It does not deal with how/where it fits in the LLVM pipeline.</div><div class="gmail_default" style="font-family:verdana,sans-serif">The gist is that NVVMreflect replaces some conditionals with constants. libdevice uses that as a poor man's IR preprocessor, conditionally enabling different implementations and relying on DCE and constant folding to remove unused parts and eliminate the now useless branches.</div><div class="gmail_default" style="font-family:verdana,sans-serif">While running NVVM alone will make libdevice code valid and usable, it would still benefit from further optimizations. I do not know to what degree, though.</div></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Link early:<br>

1) clang emits module.bc and links in libdevice.bc but with the<br>

    `optnone`, `noinline`, and "used" attribute for functions in<br>

    libdevice. ("used" is not an attribute but could as well be.)<br>

    At this stage module.bc might call __nv_XXX or llvm.XXX freely<br>

    as defined by -ffast-math and friends.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">That could work. Just carrying extra IR around would probably be OK.</div><div class="gmail_default" style="font-family:verdana,sans-serif">We may want to do NVVMReflect as soon as we have it linked in and, maybe, allow optimizing the functions that are explicitly used already.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"></blockquote><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"> </span><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

2) Run some optimizations in the middle end, maybe till the end of<br>

    the inliner loop, unsure.<br>

3) Run a libcall lowering pass and another NVVMReflect pass (or the<br>

    only instance thereof).<span class="gmail_default" style="font-family:verdana,sans-serif"> </span>We effectively remove all llvm.XXX calls</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

    in favor of __nv_XXX now. Note that we haven't spend (much) time<br>

    on the libdevice code as it is optnone and most passes are good<br>

    at skipping those. To me, it's unclear if the used parts should<br>

    not be optimized before we inline them anyway to avoid redoing<br>

    the optimizations over and over (per call site). That needs<br>

    measuring I guess. Also note that we can still retain the current<br>

    behavior for direct calls to __nv_XXX if we mark the call sites<br>

    as `alwaysinline`, or at least the behavior is almost like the<br>

    current one is.<br>

4) Run an always inliner pass on the __nv_XXX calls because it is<br>

    something we would do right now. Alternatively, remove `optnone`<br>

    and `noinline` from the __nv_XXX calls.<br>

5) Continue with the pipeline as before.<br>

<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">SGTM.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

As mentioned above, `optnone` avoids spending time on the libdevice<br>

until we "activate" it. At that point (globals) DCE can be scheduled<br>

to remove all unused parts right away. I don't think this is (much)<br>

more expensive than linking libdevice early right now.<br>

<br>

Link late, aka. translation units without libdevice:<br>

1) clang emits module.bc but does not link in libdevice.bc, it will be<br>

    made available later. We still can mix __nv_XXX and llvm.XXX calls<br>

    freely as above.<br>

2) Same as above.<br>

3) Same as above.<br>

4) Same as above but effectively a no-op, no __nv_XXX definitions are<br>

    available.<br>

5) Same as above.<br>

<br>

<br>

I might misunderstand something about the current pipeline but from [0]<br>

and the experiments I run locally it looks like the above should cover all<br>

the cases. WDYT?<br>

<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">The `optnone` trick may indeed remove much of the practical differences between the early/late approaches.</div><div class="gmail_default" style="font-family:verdana,sans-serif">In principle it should work.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Next question is -- is libdevice sufficient to satisfy LLVM's assumptions about the standard library.</div><div class="gmail_default" style="font-family:verdana,sans-serif">While it does provide most of the equivalents of libm functions, the set is not complete and some of the functions differ from their libm counterparts.</div><div class="gmail_default" style="font-family:verdana,sans-serif">The differences are minor, so we should be able to deal with it by generating few wrapper functions for the odd cases.</div><div class="gmail_default" style="font-family:verdana,sans-serif">Here's what clang does to provide math functions using libdevice:</div><div class="gmail_default" style="font-family:verdana,sans-serif"><a href="https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h">https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h</a><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="">The most concerning aspect of libdevice is that we don't know when we'll no longer be able to use the libdevice bitcode? My understanding is that IR does not guarantee binary stability and at some point we may just be unable to use it. Ideally we need our own libm for GPUs.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">--Artem</span></div><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"> </span><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

~ Johannes<br>

<br>

<br>

P.S. If the rewrite capability (aka libcall lowering) is generic we could<br>

      use the scheme for many other things as well.<br>

<br>

<br>

[0] <a href="https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice" rel="noreferrer" target="_blank">https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice</a><br>

<br>

<br>

><br>

> --Artem<br>

><br>

><br>

>> ~ Johannes<br>

>><br>

>><br>

>>> One thing that may work within the existing compilation model is to<br>

>>> pre-compile the standard library into PTX and then textually embed<br>

>> relevant<br>

>>> functions into the generated PTX, thus pushing the 'linking' phase past<br>

>> the<br>

>>> end of LLVM's compilation and make it look closer to the standard<br>

>>> compile/link process. This way we'd only enable libcall lowering in<br>

>> NVPTX,<br>

>>> assuming that the library functions will be magically available out<br>

>> there.<br>

>>> Injection of PTX could be done with an external script outside of LLVM<br>

>> and<br>

>>> it could be incorporated into clang driver. Bonus points for the fact<br>

>> that<br>

>>> this scheme is compatible with -fgpu-rdc out of the box -- assemble the<br>

>> PTX<br>

>>> with `ptxas -rdc` and then actually link with the library, instead of<br>

>>> injecting its PTX before invoking ptxas.<br>

>>><br>

>>> --Artem<br>

>>><br>

>>> Trying to figure out a good way to have the cake and eat it too.<br>

>>>> ~ Johannes<br>

>>>><br>

>>>><br>

>>>> On 3/10/21 2:49 PM, William Moses wrote:<br>

>>>>> Since clang (and arguably any other frontend that uses) should link in<br>

>>>>> libdevice, could we lower these intrinsics to the libdevice code?<br>

>>> The linking happens *before* LLVM gets to work on IR.<br>

>>> As I said, it's a workaround, not the solution. It's possible for LLVM to<br>

>>> still attempt lowering something in the IR into a libcall and we would<br>

>> not<br>

>>> be able to deal with that. It happens to work well enough in practice.<br>

>>><br>

>>> Do you have an example where you see the problem with -ffast-math?<br>

>>><br>

>>><br>

>>><br>

>>>>> For example, consider compiling the simple device function below:<br>

>>>>><br>

>>>>> ```<br>

>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang <a href="http://tmp.cu" rel="noreferrer" target="_blank">tmp.cu</a> -S -emit-llvm<br>

>>>>>     --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64<br>

>>>>> --cuda-gpu-arch=sm_37<br>

>>>>> __device__ double f(double x) {<br>

>>>>>        return cos(x);<br>

>>>>> }<br>

>>>>> ```<br>

>>>>><br>

>>>>> The LLVM module for it is as follows:<br>

>>>>><br>

>>>>> ```<br>

>>>>> ...<br>

>>>>> define dso_local double @_Z1fd(double %x) #0 {<br>

>>>>> entry:<br>

>>>>>      %__a.addr.i = alloca double, align 8<br>

>>>>>      %x.addr = alloca double, align 8<br>

>>>>>      store double %x, double* %x.addr, align 8<br>

>>>>>      %0 = load double, double* %x.addr, align 8<br>

>>>>>      store double %0, double* %__a.addr.i, align 8<br>

>>>>>      %1 = load double, double* %__a.addr.i, align 8<br>

>>>>>      %call.i = call contract double @__nv_cos(double %1) #7<br>

>>>>>      ret double %call.i<br>

>>>>> }<br>

>>>>><br>

>>>>> define internal double @__nv_cos(double %a) #1 {<br>

>>>>>      %q.i = alloca i32, align 4<br>

>>>>> ```<br>

>>>>><br>

>>>>> Obviously we would need to do something to ensure these functions don't<br>

>>>> get<br>

>>>>> deleted prior to their use in lowering from intrinsic to libdevice.<br>

>>>>> ...<br>

>>>>><br>

>>>>><br>

>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>> wrote:<br>

>>>>><br>

>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <<br>

>>>>>> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

>>>>>><br>

>>>>>>> Artem, Justin,<br>

>>>>>>><br>

>>>>>>> I am running into a problem and I'm curious if I'm missing something<br>

>> or<br>

>>>>>>> if the support is simply missing.<br>

>>>>>>> Am I correct to assume the NVPTX backend does not deal with<br>

>> `llvm.sin`<br>

>>>>>>> and friends?<br>

>>>>>>><br>

>>>>>> Correct. It can't deal with anything that may need to lower to a<br>

>>>> standard<br>

>>>>>> library call.<br>

>>>>>><br>

>>>>>>> This is what I see, with some variations:<br>

>> <a href="https://godbolt.org/z/PxsEWs" rel="noreferrer" target="_blank">https://godbolt.org/z/PxsEWs</a><br>

>>>>>>> If this is missing in the backend, is there a plan to get this<br>

>> working,<br>

>>>>>>> I'd really like to have the<br>

>>>>>>> intrinsics in the middle end rather than __nv_cos, not to mention<br>

>> that<br>

>>>>>>> -ffast-math does emit intrinsics<br>

>>>>>>> and crashes.<br>

>>>>>>><br>

>>>>>> It all boils down to the fact that PTX does not have the standard<br>

>>>>>> libc/libm which LLVM could lower the calls to, nor does it have a<br>

>>>> 'linking'<br>

>>>>>> phase where we could link such a library in, if we had it.<br>

>>>>>><br>

>>>>>> Libdevice bitcode does provide the implementations for some of the<br>

>>>>>> functions (though with a __nv_ prefix) and clang links it in in order<br>

>> to<br>

>>>>>> avoid generating IR that LLVM can't handle, but that's a workaround<br>

>> that<br>

>>>>>> does not help LLVM itself.<br>

>>>>>><br>

>>>>>> --Artem<br>

>>>>>><br>

>>>>>><br>

>>>>>><br>

>>>>>>> ~ Johannes<br>

>>>>>>><br>

>>>>>>><br>

>>>>>>> --<br>

>>>>>>> ───────────────────<br>

>>>>>>> ∽ Johannes (he/his)<br>

>>>>>>><br>

>>>>>>><br>

>>>>>> --<br>

>>>>>> --Artem Belevich<br>

>>>>>><br>

><br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">--Artem Belevich</div></div></div>