<div dir="ltr">We could also consider doing something slightly broader.<br><br>For example we could define a special attribute on top of the llvm.cos call/declaration etc with metadata or an attribute that points to the actual __nv_cos function. Then in a subsequent lowering pass the corresponding intrinsic with the relevant attribute has its uses replaced by the actual function.<br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <<a href="mailto:johannesdoerfert@gmail.com">johannesdoerfert@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

On 3/10/21 6:22 PM, Artem Belevich wrote:<br>

> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <<br>

> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

><br>

>> On 3/10/21 4:38 PM, Artem Belevich wrote:<br>

>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <<br>

>>> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

>>><br>

>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:<br>

>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <<br>

>>>>> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

>>>>><br>

>>>>>> Right. We could keep the definition of __nv_cos and friends<br>

>>>>>> around. Right now, -ffast-math might just crash on the user,<br>

>>>>>> which is arguably a bad thing. I can also see us benefiting<br>

>>>>>> in various other ways from llvm.cos uses instead of __nv_cos<br>

>>>>>> (assuming precision is according to the user requirements but<br>

>>>>>> that is always a condition).<br>

>>>>>><br>

>>>>>> It could be as simple as introducing __nv_cos into<br>

>>>>>> "llvm.used" and a backend matching/rewrite pass.<br>

>>>>>><br>

>>>>>> If the backend knew the libdevice location it could even pick<br>

>>>>>> the definitions from there. Maybe we could link libdevice late<br>

>>>>>> instead of eager?<br>

>>>>>><br>

>>>>> It's possible, but it would require plumbing in CUDA SDK awareness into<br>

>>>>> LLVM. While clang driver can deal with that, LLVM currently can't. The<br>

>>>>> bitcode library path would have to be provided by the user.<br>

>>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would<br>

>>>> even be fine if the middle-end does the remapping to get inlining<br>

>>>> and folding benefits also after __nv_cos is used. See below.<br>

>>>><br>

>>>><br>

>>>>> The standard library as bitcode raises some questions.<br>

>>>> Which standard library? CUDAs libdevice is a bitcode library, right?<br>

>>>><br>

>>> It's whatever LLVM will need to lower libcalls to. libdevice bitcode is<br>

>> the<br>

>>> closest approximation of that we have at the moment.<br>

>>><br>

>>><br>

>>>>> * When do we want to do the linking? If we do it at the beginning, then<br>

>>>> the<br>

>>>>> question is how to make sure unused functions are not eliminated before<br>

>>>> we<br>

>>>>> may need them, as we don't know apriori what's going to be needed. We<br>

>>>> also<br>

>>>>> do want the unused functions to be gone after we're done. Linking it in<br>

>>>>> early would allow optimizing the code better at the expense of having<br>

>> to<br>

>>>>> optimize a lot of code we'll throw away. Linking it in late has less<br>

>>>>> overhead, but leaves the linked in bitcode unoptimized, though it's<br>

>>>>> probably in the ballpark of what would happen with a real library call.<br>

>>>>> I.e. no inlining, etc.<br>

>>>>><br>

>>>>> * It incorporates linking into LLVM, which is not LLVM's job. Arguably,<br>

>>>> the<br>

>>>>> line should be drawn at the lowering to libcalls as it's done for other<br>

>>>>> back-ends. However, we're also constrained to by the need to have the<br>

>>>>> linking done before we generate PTX which prevents doing it after LLVM<br>

>> is<br>

>>>>> done generating an object file.<br>

>>>> I'm confused. Clang links in libdevice.bc early.<br>

>>> Yes. Because that's where it has to happen if we want to keep LLVM<br>

>> unaware<br>

>>> of CUDA SDK.<br>

>>> It does not have to be the case if/when LLVM can do the linking itself.<br>

>>><br>

>>><br>

>>>> If we make sure<br>

>>>> `__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`<br>

>>>> to `__nv_cos` which is available. After the lowering we can remove<br>

>>>> the artificial uses of `__nv_XXX` functions that we used to keep the<br>

>>>> definitions around in order to remove them from the final result.<br>

>>>><br>

>>> This is the 'link early' approach, I should've been explicit that it's<br>

>>> 'link early *everything*' as opposed to linking only what's needed at the<br>

>>> beginning.<br>

>>> It would work at the expense of having to process/optimize 500KB worth of<br>

>>> bitcode for every compilation, whether it needs it or not.<br>

>>><br>

>>><br>

>>>> We get the benefit of having `llvm.cos` for some of the pipeline,<br>

>>>> we know it does not have all the bad effects while `__nv_cos` is defined<br>

>>>> with inline assembly. We also get the benefit of inlining `__nv_cos`<br>

>>>> and folding the implementation based on the arguments. Finally,<br>

>>>> this should work with the existing pipeline, the linking is the same<br>

>>>> as before, all we do is to keep the definitions alive longer and<br>

>>>> lower `llvm.cos` to `__nv_cos` in a middle end pass.<br>

>>>><br>

>>> Again, I agree that it is doable.<br>

>>><br>

>>><br>

>>><br>

>>>> This might be similar to the PTX solution you describe below but I feel<br>

>>>> we get the inline benefit from this without actually changing the<br>

>> pipeline<br>

>>>> at all.<br>

>>>><br>

>>> So, to summarize:<br>

>>> * link the library as bitcode early, add artificial placeholders for<br>

>>> everything, compile, remove placeholders and DCE unused stuff away.<br>

>>>     Pros:<br>

>>>        - we're already doing most of it before clang hands hands off IR to<br>

>>> LLVM, so it just pushes it a bit lower in the compilation.<br>

>>>     Cons:<br>

>>>        - runtime cost of optimizing libdevice bitcode,<br>

>>>        - libdevice may be required for all NVPTX compilations?<br>

>>><br>

>>> * link the library as bitcode late.<br>

>>>      Pros:<br>

>>>        - lower runtime cost than link-early approach.<br>

>>>      Cons:<br>

>>>        - We'll need to make sure that NVVMReflect pass processes the<br>

>> library.<br>

>>>        - less optimizations on the library functions. Some of the code<br>

>> gets<br>

>>> DCE'ed away after NVVMReflect and the rest could be optimized better.<br>

>>>        - libdevice may be required for all NVPTX compilations?<br>

>>> * 'link' with the library as PTX appended as text to LLVM's output and<br>

>> let<br>

>>> ptxas do the 'linking'<br>

>>>     Pros:  LLVM remains agnostic of CUDA SDK installation details. All it<br>

>>> does is allows lowering libcalls and leaves their resolution to the<br>

>>> external tools.<br>

>>>     Cons: Need to have the PTX library somewhere and need to integrate the<br>

>>> 'linking' into the compilation process somehow.<br>

>>><br>

>>> Neither is particularly good. If the runtime overhead of link-early is<br>

>>> acceptable, then it may be a winner here, by a very small margin.<br>

>>> link-as-PTX may be better conceptually as it keeps linking and<br>

>> compilation<br>

>>> separate.<br>

>>><br>

>>> As for the practical steps, here's what we need:<br>

>>> - allow libcall lowering in NVPTX, possibly guarded by a flag. This is<br>

>>> needed for all of the approaches above.<br>

>>> - teach LLVM how to link in bitcode (and, possibly, control early/late<br>

>> mode)<br>

>>> - teach clang driver to delegate libdevice linking to LLVM.<br>

>>><br>

>>> This will allow us to experiment with all three approaches and see what<br>

>>> works best.<br>

>> I think if we embed knowledge about the nv_XXX functions we can<br>

>> even get away without the cons you listed for early linking above.<br>

>><br>

> WDYM by `embed knowledge about the nv_XXX functions`? By linking those<br>

> functions in? Of do you mean that we should just declare them<br>

> before/instead of linking libdevice in?<br>

I mean by providing the "libcall lowering" pass. So the knowledge<br>

that llvm.cos maps to __nv_cos.<br>

<br>

><br>

><br>

>> For early link I'm assuming an order similar to [0] but I also discuss<br>

>> the case where we don't link libdevice early for a TU.<br>

>><br>

> That link just describes the steps needed to use libdevice. It does not<br>

> deal with how/where it fits in the LLVM pipeline.<br>

> The gist is that NVVMreflect replaces some conditionals with constants.<br>

> libdevice uses that as a poor man's IR preprocessor, conditionally enabling<br>

> different implementations and relying on DCE and constant folding to remove<br>

> unused parts and eliminate the now useless branches.<br>

> While running NVVM alone will make libdevice code valid and usable, it<br>

> would still benefit from further optimizations. I do not know to what<br>

> degree, though.<br>

><br>

><br>

>> Link early:<br>

>> 1) clang emits module.bc and links in libdevice.bc but with the<br>

>>      `optnone`, `noinline`, and "used" attribute for functions in<br>

>>      libdevice. ("used" is not an attribute but could as well be.)<br>

>>      At this stage module.bc might call __nv_XXX or llvm.XXX freely<br>

>>      as defined by -ffast-math and friends.<br>

>><br>

> That could work. Just carrying extra IR around would probably be OK.<br>

> We may want to do NVVMReflect as soon as we have it linked in and, maybe,<br>

> allow optimizing the functions that are explicitly used already.<br>

<br>

Right. NVVMReflect can be run twice and with `alwaysinline`<br>

on the call sites of __nv_XXX functions we will actually<br>

inline and optimize them while the definitions are just "dragged<br>

along" in case we need them later.<br>

<br>

<br>

>> 2) Run some optimizations in the middle end, maybe till the end of<br>

>>      the inliner loop, unsure.<br>

>> 3) Run a libcall lowering pass and another NVVMReflect pass (or the<br>

>>      only instance thereof). We effectively remove all llvm.XXX calls<br>

>      in favor of __nv_XXX now. Note that we haven't spend (much) time<br>

>>      on the libdevice code as it is optnone and most passes are good<br>

>>      at skipping those. To me, it's unclear if the used parts should<br>

>>      not be optimized before we inline them anyway to avoid redoing<br>

>>      the optimizations over and over (per call site). That needs<br>

>>      measuring I guess. Also note that we can still retain the current<br>

>>      behavior for direct calls to __nv_XXX if we mark the call sites<br>

>>      as `alwaysinline`, or at least the behavior is almost like the<br>

>>      current one is.<br>

>> 4) Run an always inliner pass on the __nv_XXX calls because it is<br>

>>      something we would do right now. Alternatively, remove `optnone`<br>

>>      and `noinline` from the __nv_XXX calls.<br>

>> 5) Continue with the pipeline as before.<br>

>><br>

>><br>

> SGTM.<br>

><br>

><br>

>> As mentioned above, `optnone` avoids spending time on the libdevice<br>

>> until we "activate" it. At that point (globals) DCE can be scheduled<br>

>> to remove all unused parts right away. I don't think this is (much)<br>

>> more expensive than linking libdevice early right now.<br>

>><br>

>> Link late, aka. translation units without libdevice:<br>

>> 1) clang emits module.bc but does not link in libdevice.bc, it will be<br>

>>      made available later. We still can mix __nv_XXX and llvm.XXX calls<br>

>>      freely as above.<br>

>> 2) Same as above.<br>

>> 3) Same as above.<br>

>> 4) Same as above but effectively a no-op, no __nv_XXX definitions are<br>

>>      available.<br>

>> 5) Same as above.<br>

>><br>

>><br>

>> I might misunderstand something about the current pipeline but from [0]<br>

>> and the experiments I run locally it looks like the above should cover all<br>

>> the cases. WDYT?<br>

>><br>

>><br>

> The `optnone` trick may indeed remove much of the practical differences<br>

> between the early/late approaches.<br>

> In principle it should work.<br>

><br>

> Next question is -- is libdevice sufficient to satisfy LLVM's assumptions<br>

> about the standard library.<br>

> While it does provide most of the equivalents of libm functions, the set is<br>

> not complete and some of the functions differ from their libm counterparts.<br>

> The differences are minor, so we should be able to deal with it by<br>

> generating few wrapper functions for the odd cases.<br>

> Here's what clang does to provide math functions using libdevice:<br>

> <a href="https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h" rel="noreferrer" target="_blank">https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h</a><br>

<br>

Right now, clang will generate any llvm intrinsic and we crash, so anything<br>

else is probably a step in the right direction. Eventually, we should <br>

"lower"<br>

all intrinsics that the NVPTX backend can't handle or at least emit a nice<br>

error message. Preferably, clang would know what we can't deal with and not<br>

generate intinsic calls for those in the first place.<br>

<br>

<br>

><br>

> The most concerning aspect of libdevice is that we don't know when we'll no<br>

> longer be able to use the libdevice bitcode? My understanding is that IR<br>

> does not guarantee binary stability and at some point we may just be unable<br>

> to use it. Ideally we need our own libm for GPUs.<br>

<br>

For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy<br>

backing on CUDA and libc++ implementations, I hope it will stay that way.<br>

That said, if the need arises we might really have to port libc++ to the<br>

GPUs.<br>

<br>

Back to the problem with libdevice. I agree that the solution of NVIDIA<br>

to ship a .bc library is suboptimal but with the existing, or an extended,<br>

auto-upgrader we might be able to make that work reasonably well for the<br>

foreseeable future. That problem is orthogonal to what we are discussing<br>

above, I think.<br>

<br>

~ Johannes<br>

<br>

<br>

><br>

> --Artem<br>

><br>

><br>

>> ~ Johannes<br>

>><br>

>><br>

>> P.S. If the rewrite capability (aka libcall lowering) is generic we could<br>

>>        use the scheme for many other things as well.<br>

>><br>

>><br>

>> [0] <a href="https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice" rel="noreferrer" target="_blank">https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice</a><br>

>><br>

>><br>

>>> --Artem<br>

>>><br>

>>><br>

>>>> ~ Johannes<br>

>>>><br>

>>>><br>

>>>>> One thing that may work within the existing compilation model is to<br>

>>>>> pre-compile the standard library into PTX and then textually embed<br>

>>>> relevant<br>

>>>>> functions into the generated PTX, thus pushing the 'linking' phase past<br>

>>>> the<br>

>>>>> end of LLVM's compilation and make it look closer to the standard<br>

>>>>> compile/link process. This way we'd only enable libcall lowering in<br>

>>>> NVPTX,<br>

>>>>> assuming that the library functions will be magically available out<br>

>>>> there.<br>

>>>>> Injection of PTX could be done with an external script outside of LLVM<br>

>>>> and<br>

>>>>> it could be incorporated into clang driver. Bonus points for the fact<br>

>>>> that<br>

>>>>> this scheme is compatible with -fgpu-rdc out of the box -- assemble the<br>

>>>> PTX<br>

>>>>> with `ptxas -rdc` and then actually link with the library, instead of<br>

>>>>> injecting its PTX before invoking ptxas.<br>

>>>>><br>

>>>>> --Artem<br>

>>>>><br>

>>>>> Trying to figure out a good way to have the cake and eat it too.<br>

>>>>>> ~ Johannes<br>

>>>>>><br>

>>>>>><br>

>>>>>> On 3/10/21 2:49 PM, William Moses wrote:<br>

>>>>>>> Since clang (and arguably any other frontend that uses) should link<br>

>> in<br>

>>>>>>> libdevice, could we lower these intrinsics to the libdevice code?<br>

>>>>> The linking happens *before* LLVM gets to work on IR.<br>

>>>>> As I said, it's a workaround, not the solution. It's possible for LLVM<br>

>> to<br>

>>>>> still attempt lowering something in the IR into a libcall and we would<br>

>>>> not<br>

>>>>> be able to deal with that. It happens to work well enough in practice.<br>

>>>>><br>

>>>>> Do you have an example where you see the problem with -ffast-math?<br>

>>>>><br>

>>>>><br>

>>>>><br>

>>>>>>> For example, consider compiling the simple device function below:<br>

>>>>>>><br>

>>>>>>> ```<br>

>>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang <a href="http://tmp.cu" rel="noreferrer" target="_blank">tmp.cu</a> -S -emit-llvm<br>

>>>>>>>      --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64<br>

>>>>>>> --cuda-gpu-arch=sm_37<br>

>>>>>>> __device__ double f(double x) {<br>

>>>>>>>         return cos(x);<br>

>>>>>>> }<br>

>>>>>>> ```<br>

>>>>>>><br>

>>>>>>> The LLVM module for it is as follows:<br>

>>>>>>><br>

>>>>>>> ```<br>

>>>>>>> ...<br>

>>>>>>> define dso_local double @_Z1fd(double %x) #0 {<br>

>>>>>>> entry:<br>

>>>>>>>       %__a.addr.i = alloca double, align 8<br>

>>>>>>>       %x.addr = alloca double, align 8<br>

>>>>>>>       store double %x, double* %x.addr, align 8<br>

>>>>>>>       %0 = load double, double* %x.addr, align 8<br>

>>>>>>>       store double %0, double* %__a.addr.i, align 8<br>

>>>>>>>       %1 = load double, double* %__a.addr.i, align 8<br>

>>>>>>>       %call.i = call contract double @__nv_cos(double %1) #7<br>

>>>>>>>       ret double %call.i<br>

>>>>>>> }<br>

>>>>>>><br>

>>>>>>> define internal double @__nv_cos(double %a) #1 {<br>

>>>>>>>       %q.i = alloca i32, align 4<br>

>>>>>>> ```<br>

>>>>>>><br>

>>>>>>> Obviously we would need to do something to ensure these functions<br>

>> don't<br>

>>>>>> get<br>

>>>>>>> deleted prior to their use in lowering from intrinsic to libdevice.<br>

>>>>>>> ...<br>

>>>>>>><br>

>>>>>>><br>

>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>><br>

>> wrote:<br>

>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <<br>

>>>>>>>> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

>>>>>>>><br>

>>>>>>>>> Artem, Justin,<br>

>>>>>>>>><br>

>>>>>>>>> I am running into a problem and I'm curious if I'm missing<br>

>> something<br>

>>>> or<br>

>>>>>>>>> if the support is simply missing.<br>

>>>>>>>>> Am I correct to assume the NVPTX backend does not deal with<br>

>>>> `llvm.sin`<br>

>>>>>>>>> and friends?<br>

>>>>>>>>><br>

>>>>>>>> Correct. It can't deal with anything that may need to lower to a<br>

>>>>>> standard<br>

>>>>>>>> library call.<br>

>>>>>>>><br>

>>>>>>>>> This is what I see, with some variations:<br>

>>>> <a href="https://godbolt.org/z/PxsEWs" rel="noreferrer" target="_blank">https://godbolt.org/z/PxsEWs</a><br>

>>>>>>>>> If this is missing in the backend, is there a plan to get this<br>

>>>> working,<br>

>>>>>>>>> I'd really like to have the<br>

>>>>>>>>> intrinsics in the middle end rather than __nv_cos, not to mention<br>

>>>> that<br>

>>>>>>>>> -ffast-math does emit intrinsics<br>

>>>>>>>>> and crashes.<br>

>>>>>>>>><br>

>>>>>>>> It all boils down to the fact that PTX does not have the standard<br>

>>>>>>>> libc/libm which LLVM could lower the calls to, nor does it have a<br>

>>>>>> 'linking'<br>

>>>>>>>> phase where we could link such a library in, if we had it.<br>

>>>>>>>><br>

>>>>>>>> Libdevice bitcode does provide the implementations for some of the<br>

>>>>>>>> functions (though with a __nv_ prefix) and clang links it in in<br>

>> order<br>

>>>> to<br>

>>>>>>>> avoid generating IR that LLVM can't handle, but that's a workaround<br>

>>>> that<br>

>>>>>>>> does not help LLVM itself.<br>

>>>>>>>><br>

>>>>>>>> --Artem<br>

>>>>>>>><br>

>>>>>>>><br>

>>>>>>>><br>

>>>>>>>>> ~ Johannes<br>

>>>>>>>>><br>

>>>>>>>>><br>

>>>>>>>>> --<br>

>>>>>>>>> ───────────────────<br>

>>>>>>>>> ∽ Johannes (he/his)<br>

>>>>>>>>><br>

>>>>>>>>><br>

>>>>>>>> --<br>

>>>>>>>> --Artem Belevich<br>

>>>>>>>><br>

><br>

</blockquote></div>