<div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif"></div><div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <<a href="mailto:johannesdoerfert@gmail.com">johannesdoerfert@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

On 3/10/21 3:25 PM, Artem Belevich wrote:<br>

> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <<br>

> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

><br>

>> Right. We could keep the definition of __nv_cos and friends<br>

>> around. Right now, -ffast-math might just crash on the user,<br>

>> which is arguably a bad thing. I can also see us benefiting<br>

>> in various other ways from llvm.cos uses instead of __nv_cos<br>

>> (assuming precision is according to the user requirements but<br>

>> that is always a condition).<br>

>><br>

>> It could be as simple as introducing __nv_cos into<br>

>> "llvm.used" and a backend matching/rewrite pass.<br>

>><br>

>> If the backend knew the libdevice location it could even pick<br>

>> the definitions from there. Maybe we could link libdevice late<br>

>> instead of eager?<br>

>><br>

> It's possible, but it would require plumbing in CUDA SDK awareness into<br>

> LLVM. While clang driver can deal with that, LLVM currently can't. The<br>

> bitcode library path would have to be provided by the user.<br>

<br>

The PTX backend could arguably be CUDA SDK aware, IMHO, it would<br>

even be fine if the middle-end does the remapping to get inlining<br>

and folding benefits also after __nv_cos is used. See below.<br>

<br>

<br>

> The standard library as bitcode raises some questions.<br>

<br>

Which standard library? CUDAs libdevice is a bitcode library, right?<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">It's whatever LLVM will need to lower libcalls to. libdevice bitcode is the closest approximation of that we have at the moment.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

> * When do we want to do the linking? If we do it at the beginning, then the<br>

> question is how to make sure unused functions are not eliminated before we<br>

> may need them, as we don't know apriori what's going to be needed. We also<br>

> do want the unused functions to be gone after we're done. Linking it in<br>

> early would allow optimizing the code better at the expense of having to<br>

> optimize a lot of code we'll throw away. Linking it in late has less<br>

> overhead, but leaves the linked in bitcode unoptimized, though it's<br>

> probably in the ballpark of what would happen with a real library call.<br>

> I.e. no inlining, etc.<br>

><br>

> * It incorporates linking into LLVM, which is not LLVM's job. Arguably, the<br>

> line should be drawn at the lowering to libcalls as it's done for other<br>

> back-ends. However, we're also constrained to by the need to have the<br>

> linking done before we generate PTX which prevents doing it after LLVM is<br>

> done generating an object file.<br>

<br>

I'm confused. Clang links in libdevice.bc early.</blockquote><div> </div><div><div class="gmail_default" style="font-family:verdana,sans-serif">Yes. Because that's where it has to happen if we want to keep LLVM unaware of CUDA SDK.</div></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">It does not have to be the case if/when LLVM can do the linking itself.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">If we make sure<br>

`__nv_cos` is not deleted early, we can at any point "lower" `llvm.cos`<br>

to `__nv_cos` which is available. After the lowering we can remove<br>

the artificial uses of `__nv_XXX` functions that we used to keep the<br>

definitions around in order to remove them from the final result.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">This is the 'link early' approach, I should've been explicit that it's 'link early *everything*' as opposed to linking only what's needed at the beginning.</div><div class="gmail_default" style="font-family:verdana,sans-serif">It would work at the expense of having to process/optimize 500KB worth of bitcode for every compilation, whether it needs it or not.</div></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

We get the benefit of having `llvm.cos` for some of the pipeline,<br>

we know it does not have all the bad effects while `__nv_cos` is defined<br>

with inline assembly. We also get the benefit of inlining `__nv_cos`<br>

and folding the implementation based on the arguments. Finally,<br>

this should work with the existing pipeline, the linking is the same<br>

as before, all we do is to keep the definitions alive longer and<br>

lower `llvm.cos` to `__nv_cos` in a middle end pass.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">Again, I agree that it is doable. </div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

This might be similar to the PTX solution you describe below but I feel<br>

we get the inline benefit from this without actually changing the pipeline<br>

at all.<br></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">So, to summarize:</div><div class="gmail_default" style="font-family:verdana,sans-serif">* link the library as bitcode early, add artificial placeholders for everything, compile, remove placeholders and DCE unused stuff away.</div><div class="gmail_default" style="font-family:verdana,sans-serif">  Pros: </div><div class="gmail_default" style="font-family:verdana,sans-serif">     - we're already doing most of it before clang hands hands off IR to LLVM, so it just pushes it a bit lower in the compilation.</div><div class="gmail_default" style="font-family:verdana,sans-serif">  Cons: </div><div class="gmail_default" style="font-family:verdana,sans-serif">     - runtime cost of optimizing libdevice bitcode, </div><div class="gmail_default" style="font-family:verdana,sans-serif">     - libdevice may be required for all NVPTX compilations? </div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">* link the library as bitcode late.</div><div class="gmail_default" style="font-family:verdana,sans-serif">   Pros: </div><div class="gmail_default" style="font-family:verdana,sans-serif">     - lower runtime cost than link-early approach.<br></div><div class="gmail_default" style="font-family:verdana,sans-serif">   Cons:</div><div class="gmail_default" style="font-family:verdana,sans-serif">     - We'll need to make sure that NVVMReflect pass processes the library.</div><div class="gmail_default" style="font-family:verdana,sans-serif">     - less optimizations on the library functions. Some of the code gets DCE'ed away after NVVMReflect and the rest could be optimized better.<br></div><div class="gmail_default" style="font-family:verdana,sans-serif">     - libdevice may be required for all NVPTX compilations? </div><div class="gmail_default" style="font-family:verdana,sans-serif"></div><div class="gmail_default" style="font-family:verdana,sans-serif"></div><div class="gmail_default" style="font-family:verdana,sans-serif">* 'link' with the library as PTX appended as text to LLVM's output and let ptxas do the 'linking'</div><div class="gmail_default" style="font-family:verdana,sans-serif">  Pros:  LLVM remains agnostic of CUDA SDK installation details. All it does is allows lowering libcalls and leaves their resolution to the external tools.</div><div class="gmail_default" style="font-family:verdana,sans-serif">  Cons: Need to have the PTX library somewhere and need to integrate the 'linking' into the compilation process somehow.</div></div><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">Neither is particularly good. If the runtime overhead of link-early is acceptable, then it may be a winner here, by a very small margin.</div><div class="gmail_default" style="font-family:verdana,sans-serif">link-as-PTX may be better conceptually as it keeps linking and compilation separate.</div></div><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">As for the practical steps, here's what we need:</div><div class="gmail_default" style="font-family:verdana,sans-serif">- allow libcall lowering in NVPTX, possibly guarded by a flag. This is needed for all of the approaches above.</div><div class="gmail_default" style="font-family:verdana,sans-serif">- teach LLVM how to link in bitcode (and, possibly, control early/late mode)</div><div class="gmail_default" style="font-family:verdana,sans-serif">- teach clang driver to delegate libdevice linking to LLVM.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">This will allow us to experiment with all three approaches and see what works best.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">--Artem</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

~ Johannes<br>

<br>

<br>

><br>

> One thing that may work within the existing compilation model is to<br>

> pre-compile the standard library into PTX and then textually embed relevant<br>

> functions into the generated PTX, thus pushing the 'linking' phase past the<br>

> end of LLVM's compilation and make it look closer to the standard<br>

> compile/link process. This way we'd only enable libcall lowering in NVPTX,<br>

> assuming that the library functions will be magically available out there.<br>

> Injection of PTX could be done with an external script outside of LLVM and<br>

> it could be incorporated into clang driver. Bonus points for the fact that<br>

> this scheme is compatible with -fgpu-rdc out of the box -- assemble the PTX<br>

> with `ptxas -rdc` and then actually link with the library, instead of<br>

> injecting its PTX before invoking ptxas.<br>

><br>

> --Artem<br>

><br>

> Trying to figure out a good way to have the cake and eat it too.<br>

>> ~ Johannes<br>

>><br>

>><br>

>> On 3/10/21 2:49 PM, William Moses wrote:<br>

>>> Since clang (and arguably any other frontend that uses) should link in<br>

>>> libdevice, could we lower these intrinsics to the libdevice code?<br>

> The linking happens *before* LLVM gets to work on IR.<br>

> As I said, it's a workaround, not the solution. It's possible for LLVM to<br>

> still attempt lowering something in the IR into a libcall and we would not<br>

> be able to deal with that. It happens to work well enough in practice.<br>

><br>

> Do you have an example where you see the problem with -ffast-math?<br>

><br>

><br>

><br>

>>> For example, consider compiling the simple device function below:<br>

>>><br>

>>> ```<br>

>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang <a href="http://tmp.cu" rel="noreferrer" target="_blank">tmp.cu</a> -S -emit-llvm<br>

>>>    --cuda-path=/usr/local/cuda-11.0 -L/usr/local/cuda-11.0/lib64<br>

>>> --cuda-gpu-arch=sm_37<br>

>>> __device__ double f(double x) {<br>

>>>       return cos(x);<br>

>>> }<br>

>>> ```<br>

>>><br>

>>> The LLVM module for it is as follows:<br>

>>><br>

>>> ```<br>

>>> ...<br>

>>> define dso_local double @_Z1fd(double %x) #0 {<br>

>>> entry:<br>

>>>     %__a.addr.i = alloca double, align 8<br>

>>>     %x.addr = alloca double, align 8<br>

>>>     store double %x, double* %x.addr, align 8<br>

>>>     %0 = load double, double* %x.addr, align 8<br>

>>>     store double %0, double* %__a.addr.i, align 8<br>

>>>     %1 = load double, double* %__a.addr.i, align 8<br>

>>>     %call.i = call contract double @__nv_cos(double %1) #7<br>

>>>     ret double %call.i<br>

>>> }<br>

>>><br>

>>> define internal double @__nv_cos(double %a) #1 {<br>

>>>     %q.i = alloca i32, align 4<br>

>>> ```<br>

>>><br>

>>> Obviously we would need to do something to ensure these functions don't<br>

>> get<br>

>>> deleted prior to their use in lowering from intrinsic to libdevice.<br>

>>> ...<br>

>>><br>

>>><br>

>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>> wrote:<br>

>>><br>

>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <<br>

>>>> <a href="mailto:johannesdoerfert@gmail.com" target="_blank">johannesdoerfert@gmail.com</a>> wrote:<br>

>>>><br>

>>>>> Artem, Justin,<br>

>>>>><br>

>>>>> I am running into a problem and I'm curious if I'm missing something or<br>

>>>>> if the support is simply missing.<br>

>>>>> Am I correct to assume the NVPTX backend does not deal with `llvm.sin`<br>

>>>>> and friends?<br>

>>>>><br>

>>>> Correct. It can't deal with anything that may need to lower to a<br>

>> standard<br>

>>>> library call.<br>

>>>><br>

>>>>> This is what I see, with some variations: <a href="https://godbolt.org/z/PxsEWs" rel="noreferrer" target="_blank">https://godbolt.org/z/PxsEWs</a><br>

>>>>><br>

>>>>> If this is missing in the backend, is there a plan to get this working,<br>

>>>>> I'd really like to have the<br>

>>>>> intrinsics in the middle end rather than __nv_cos, not to mention that<br>

>>>>> -ffast-math does emit intrinsics<br>

>>>>> and crashes.<br>

>>>>><br>

>>>> It all boils down to the fact that PTX does not have the standard<br>

>>>> libc/libm which LLVM could lower the calls to, nor does it have a<br>

>> 'linking'<br>

>>>> phase where we could link such a library in, if we had it.<br>

>>>><br>

>>>> Libdevice bitcode does provide the implementations for some of the<br>

>>>> functions (though with a __nv_ prefix) and clang links it in in order to<br>

>>>> avoid generating IR that LLVM can't handle, but that's a workaround that<br>

>>>> does not help LLVM itself.<br>

>>>><br>

>>>> --Artem<br>

>>>><br>

>>>><br>

>>>><br>

>>>>> ~ Johannes<br>

>>>>><br>

>>>>><br>

>>>>> --<br>

>>>>> ───────────────────<br>

>>>>> ∽ Johannes (he/his)<br>

>>>>><br>

>>>>><br>

>>>> --<br>

>>>> --Artem Belevich<br>

>>>><br>

><br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">--Artem Belevich</div></div></div>