[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Thu Mar 11 16:10:56 PST 2021

On 3/11/21 1:37 PM, Artem Belevich wrote:
> On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> I certainly agree we should try to avoid a hard-coded mapping
>> in C++.
>>
>> I could see something like:
>>
>> ```
>> __attribute__((implementation("llvm.cos"))
>> double __nv_cos(...) { ... }
>>
>> ```
>>
>> and a pass that transforms all calls to a function with an
>> "implementation" to calls to that implementation. Maybe
>> later we attach a score/priority ;)
>>
>>
> I'm not sure how that would work.
> Where would you place that `__attribute__((implementation))` ? We do not
> have the definitions for `__nv_*` as they come from NVIDIA-provided
> bitcode. We could add the attribute to the declaration in
> `__clang_cuda_libdevice_declares.h`.
> How does LLVM handle the differences in function attributes between
> function declaration and definition? Will there be trouble when we link in
> the actual __nv_cos from the libdevice that would not have that attribute?
>
> Another potential gotcha is that for the functions that can't be directly
> mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the
> implementation ourselves. We will not know whether the implementation will
> be used until after the substitution pass, so we'll need to make sure it's
> not DCE'd until then. It appears to be the same issue (though on a smaller
> scale) as with linking in libdevice directly.
>
> Let's take a step back and figure out what are the issues we want to solve.
>
> The top-level goal is to provide implementation for LLVM intrinsics. For
> now let's stick with libm-related ones.
> What we have is the libdevice bitcode which uses different function names
> and provides a subset of the functionality we need.
> What we miss is
>    - something to connect LLVM's libcalls to the GPU-side implementation,
>    - additional code to provide implementations for the functions that are
> missing or different in libdevice.
>
> Considering that we want this to work in LLVM, the additional code would
> have to be a bitcode and it would have to exist in addition to libdevice.
> Our options for the mapping between LLVM intrinsics and the implementation
> are
> * intrinsic -> __nv_* equivalent mapping pass
>     This would still need additional bitcode for the missing/different
> functions.
> * lower libcalls to the standard libm APIs, implement libm -> __nv_*
> mapping in our own bitcode.
>
> Considering that additional bitcode is needed in both cases, I believe that
> the second approach makes more sense.

I really hope to avoid any additional bitcode, there are too many
drawbacks and basically no benefits, IMHO.

> LLVM does not need to know or care about what's provided by libdevice, and
> we'd have more flexibility, compared to what we could do in the mapping
> pass. It also makes it easy to substitute a different implementation, if we
> have or need one.

I agree that LLVM (core) should not know about __nv_*, that's why I 
suggested
the `__attribute__((implements("...")))` approach. My preferred solution
is still to annotate our declarations of __nv_* and point to the
llvm.intrinsics (name) from there. If we have a missing mapping, we 
point to an
intrinsic from a definition that lives in the Clang headers next to the
__nv_*  declarations.

This does not yet work because -mlink-builtin-bitcode (which I assume
triggers the llvm-link logic) will drop the attributes of a declaration
if a definition is found. I think that should not be the case anyway
such that the union of attributes is set.

The benefit I see for the above is that the mapping is tied to the
declarations and doesn't live in a tablegen file far away. It works well
even if we can't map 1:1, and we could even restrict the "used" attribute
to anything that has an "implements" attribute. So:

```
__nv_A() { ... } // called, inlined and optimized as before, DCE'ed after.

__nv_B() { ... } // not called, DCE'ed.

__attribute__((implements("llvm.C"))
__nv_C() { ... } // calls are inlined and optimized as before, not DCE'ed
                  // though because of the attribute. Replaces llvm.C as
                  // callee in the special pass.
```

So "implements" gives you a way to statically replace a function declaration
or definition with another one. I could see it being used to provide other
intrinsics to platforms with backends that don't support them.

Does that make some sense?

~ Johannes

>
> WDYT?
>
> --Artem
>
>
>>
>> On 3/10/21 8:44 PM, William Moses wrote:
>>> We could also consider doing something slightly broader.
>>>
>>> For example we could define a special attribute on top of the llvm.cos
>>> call/declaration etc with metadata or an attribute that points to the
>>> actual __nv_cos function. Then in a subsequent lowering pass the
>>> corresponding intrinsic with the relevant attribute has its uses replaced
>>> by the actual function.
>>>
>>>
>>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
>>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <
>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>
>>>>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
>>>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Right. We could keep the definition of __nv_cos and friends
>>>>>>>>>> around. Right now, -ffast-math might just crash on the user,
>>>>>>>>>> which is arguably a bad thing. I can also see us benefiting
>>>>>>>>>> in various other ways from llvm.cos uses instead of __nv_cos
>>>>>>>>>> (assuming precision is according to the user requirements but
>>>>>>>>>> that is always a condition).
>>>>>>>>>>
>>>>>>>>>> It could be as simple as introducing __nv_cos into
>>>>>>>>>> "llvm.used" and a backend matching/rewrite pass.
>>>>>>>>>>
>>>>>>>>>> If the backend knew the libdevice location it could even pick
>>>>>>>>>> the definitions from there. Maybe we could link libdevice late
>>>>>>>>>> instead of eager?
>>>>>>>>>>
>>>>>>>>> It's possible, but it would require plumbing in CUDA SDK awareness
>>>> into
>>>>>>>>> LLVM. While clang driver can deal with that, LLVM currently can't.
>>>> The
>>>>>>>>> bitcode library path would have to be provided by the user.
>>>>>>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would
>>>>>>>> even be fine if the middle-end does the remapping to get inlining
>>>>>>>> and folding benefits also after __nv_cos is used. See below.
>>>>>>>>
>>>>>>>>
>>>>>>>>> The standard library as bitcode raises some questions.
>>>>>>>> Which standard library? CUDAs libdevice is a bitcode library, right?
>>>>>>>>
>>>>>>> It's whatever LLVM will need to lower libcalls to. libdevice bitcode
>> is
>>>>>> the
>>>>>>> closest approximation of that we have at the moment.
>>>>>>>
>>>>>>>
>>>>>>>>> * When do we want to do the linking? If we do it at the beginning,
>>>> then
>>>>>>>> the
>>>>>>>>> question is how to make sure unused functions are not eliminated
>>>> before
>>>>>>>> we
>>>>>>>>> may need them, as we don't know apriori what's going to be needed.
>> We
>>>>>>>> also
>>>>>>>>> do want the unused functions to be gone after we're done. Linking
>> it
>>>> in
>>>>>>>>> early would allow optimizing the code better at the expense of
>> having
>>>>>> to
>>>>>>>>> optimize a lot of code we'll throw away. Linking it in late has
>> less
>>>>>>>>> overhead, but leaves the linked in bitcode unoptimized, though it's
>>>>>>>>> probably in the ballpark of what would happen with a real library
>>>> call.
>>>>>>>>> I.e. no inlining, etc.
>>>>>>>>>
>>>>>>>>> * It incorporates linking into LLVM, which is not LLVM's job.
>>>> Arguably,
>>>>>>>> the
>>>>>>>>> line should be drawn at the lowering to libcalls as it's done for
>>>> other
>>>>>>>>> back-ends. However, we're also constrained to by the need to have
>> the
>>>>>>>>> linking done before we generate PTX which prevents doing it after
>>>> LLVM
>>>>>> is
>>>>>>>>> done generating an object file.
>>>>>>>> I'm confused. Clang links in libdevice.bc early.
>>>>>>> Yes. Because that's where it has to happen if we want to keep LLVM
>>>>>> unaware
>>>>>>> of CUDA SDK.
>>>>>>> It does not have to be the case if/when LLVM can do the linking
>> itself.
>>>>>>>
>>>>>>>> If we make sure
>>>>>>>> `__nv_cos` is not deleted early, we can at any point "lower"
>>>> `llvm.cos`
>>>>>>>> to `__nv_cos` which is available. After the lowering we can remove
>>>>>>>> the artificial uses of `__nv_XXX` functions that we used to keep the
>>>>>>>> definitions around in order to remove them from the final result.
>>>>>>>>
>>>>>>> This is the 'link early' approach, I should've been explicit that
>> it's
>>>>>>> 'link early *everything*' as opposed to linking only what's needed at
>>>> the
>>>>>>> beginning.
>>>>>>> It would work at the expense of having to process/optimize 500KB
>> worth
>>>> of
>>>>>>> bitcode for every compilation, whether it needs it or not.
>>>>>>>
>>>>>>>
>>>>>>>> We get the benefit of having `llvm.cos` for some of the pipeline,
>>>>>>>> we know it does not have all the bad effects while `__nv_cos` is
>>>> defined
>>>>>>>> with inline assembly. We also get the benefit of inlining `__nv_cos`
>>>>>>>> and folding the implementation based on the arguments. Finally,
>>>>>>>> this should work with the existing pipeline, the linking is the same
>>>>>>>> as before, all we do is to keep the definitions alive longer and
>>>>>>>> lower `llvm.cos` to `__nv_cos` in a middle end pass.
>>>>>>>>
>>>>>>> Again, I agree that it is doable.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> This might be similar to the PTX solution you describe below but I
>>>> feel
>>>>>>>> we get the inline benefit from this without actually changing the
>>>>>> pipeline
>>>>>>>> at all.
>>>>>>>>
>>>>>>> So, to summarize:
>>>>>>> * link the library as bitcode early, add artificial placeholders for
>>>>>>> everything, compile, remove placeholders and DCE unused stuff away.
>>>>>>>       Pros:
>>>>>>>          - we're already doing most of it before clang hands hands off
>>>> IR to
>>>>>>> LLVM, so it just pushes it a bit lower in the compilation.
>>>>>>>       Cons:
>>>>>>>          - runtime cost of optimizing libdevice bitcode,
>>>>>>>          - libdevice may be required for all NVPTX compilations?
>>>>>>>
>>>>>>> * link the library as bitcode late.
>>>>>>>        Pros:
>>>>>>>          - lower runtime cost than link-early approach.
>>>>>>>        Cons:
>>>>>>>          - We'll need to make sure that NVVMReflect pass processes the
>>>>>> library.
>>>>>>>          - less optimizations on the library functions. Some of the
>> code
>>>>>> gets
>>>>>>> DCE'ed away after NVVMReflect and the rest could be optimized better.
>>>>>>>          - libdevice may be required for all NVPTX compilations?
>>>>>>> * 'link' with the library as PTX appended as text to LLVM's output
>> and
>>>>>> let
>>>>>>> ptxas do the 'linking'
>>>>>>>       Pros:  LLVM remains agnostic of CUDA SDK installation details.
>> All
>>>> it
>>>>>>> does is allows lowering libcalls and leaves their resolution to the
>>>>>>> external tools.
>>>>>>>       Cons: Need to have the PTX library somewhere and need to
>> integrate
>>>> the
>>>>>>> 'linking' into the compilation process somehow.
>>>>>>>
>>>>>>> Neither is particularly good. If the runtime overhead of link-early
>> is
>>>>>>> acceptable, then it may be a winner here, by a very small margin.
>>>>>>> link-as-PTX may be better conceptually as it keeps linking and
>>>>>> compilation
>>>>>>> separate.
>>>>>>>
>>>>>>> As for the practical steps, here's what we need:
>>>>>>> - allow libcall lowering in NVPTX, possibly guarded by a flag. This
>> is
>>>>>>> needed for all of the approaches above.
>>>>>>> - teach LLVM how to link in bitcode (and, possibly, control
>> early/late
>>>>>> mode)
>>>>>>> - teach clang driver to delegate libdevice linking to LLVM.
>>>>>>>
>>>>>>> This will allow us to experiment with all three approaches and see
>> what
>>>>>>> works best.
>>>>>> I think if we embed knowledge about the nv_XXX functions we can
>>>>>> even get away without the cons you listed for early linking above.
>>>>>>
>>>>> WDYM by `embed knowledge about the nv_XXX functions`? By linking those
>>>>> functions in? Of do you mean that we should just declare them
>>>>> before/instead of linking libdevice in?
>>>> I mean by providing the "libcall lowering" pass. So the knowledge
>>>> that llvm.cos maps to __nv_cos.
>>>>
>>>>>> For early link I'm assuming an order similar to [0] but I also discuss
>>>>>> the case where we don't link libdevice early for a TU.
>>>>>>
>>>>> That link just describes the steps needed to use libdevice. It does not
>>>>> deal with how/where it fits in the LLVM pipeline.
>>>>> The gist is that NVVMreflect replaces some conditionals with constants.
>>>>> libdevice uses that as a poor man's IR preprocessor, conditionally
>>>> enabling
>>>>> different implementations and relying on DCE and constant folding to
>>>> remove
>>>>> unused parts and eliminate the now useless branches.
>>>>> While running NVVM alone will make libdevice code valid and usable, it
>>>>> would still benefit from further optimizations. I do not know to what
>>>>> degree, though.
>>>>>
>>>>>
>>>>>> Link early:
>>>>>> 1) clang emits module.bc and links in libdevice.bc but with the
>>>>>>        `optnone`, `noinline`, and "used" attribute for functions in
>>>>>>        libdevice. ("used" is not an attribute but could as well be.)
>>>>>>        At this stage module.bc might call __nv_XXX or llvm.XXX freely
>>>>>>        as defined by -ffast-math and friends.
>>>>>>
>>>>> That could work. Just carrying extra IR around would probably be OK.
>>>>> We may want to do NVVMReflect as soon as we have it linked in and,
>> maybe,
>>>>> allow optimizing the functions that are explicitly used already.
>>>> Right. NVVMReflect can be run twice and with `alwaysinline`
>>>> on the call sites of __nv_XXX functions we will actually
>>>> inline and optimize them while the definitions are just "dragged
>>>> along" in case we need them later.
>>>>
>>>>
>>>>>> 2) Run some optimizations in the middle end, maybe till the end of
>>>>>>        the inliner loop, unsure.
>>>>>> 3) Run a libcall lowering pass and another NVVMReflect pass (or the
>>>>>>        only instance thereof). We effectively remove all llvm.XXX calls
>>>>>        in favor of __nv_XXX now. Note that we haven't spend (much) time
>>>>>>        on the libdevice code as it is optnone and most passes are good
>>>>>>        at skipping those. To me, it's unclear if the used parts should
>>>>>>        not be optimized before we inline them anyway to avoid redoing
>>>>>>        the optimizations over and over (per call site). That needs
>>>>>>        measuring I guess. Also note that we can still retain the
>> current
>>>>>>        behavior for direct calls to __nv_XXX if we mark the call sites
>>>>>>        as `alwaysinline`, or at least the behavior is almost like the
>>>>>>        current one is.
>>>>>> 4) Run an always inliner pass on the __nv_XXX calls because it is
>>>>>>        something we would do right now. Alternatively, remove `optnone`
>>>>>>        and `noinline` from the __nv_XXX calls.
>>>>>> 5) Continue with the pipeline as before.
>>>>>>
>>>>>>
>>>>> SGTM.
>>>>>
>>>>>
>>>>>> As mentioned above, `optnone` avoids spending time on the libdevice
>>>>>> until we "activate" it. At that point (globals) DCE can be scheduled
>>>>>> to remove all unused parts right away. I don't think this is (much)
>>>>>> more expensive than linking libdevice early right now.
>>>>>>
>>>>>> Link late, aka. translation units without libdevice:
>>>>>> 1) clang emits module.bc but does not link in libdevice.bc, it will be
>>>>>>        made available later. We still can mix __nv_XXX and llvm.XXX
>> calls
>>>>>>        freely as above.
>>>>>> 2) Same as above.
>>>>>> 3) Same as above.
>>>>>> 4) Same as above but effectively a no-op, no __nv_XXX definitions are
>>>>>>        available.
>>>>>> 5) Same as above.
>>>>>>
>>>>>>
>>>>>> I might misunderstand something about the current pipeline but from
>> [0]
>>>>>> and the experiments I run locally it looks like the above should cover
>>>> all
>>>>>> the cases. WDYT?
>>>>>>
>>>>>>
>>>>> The `optnone` trick may indeed remove much of the practical differences
>>>>> between the early/late approaches.
>>>>> In principle it should work.
>>>>>
>>>>> Next question is -- is libdevice sufficient to satisfy LLVM's
>> assumptions
>>>>> about the standard library.
>>>>> While it does provide most of the equivalents of libm functions, the
>> set
>>>> is
>>>>> not complete and some of the functions differ from their libm
>>>> counterparts.
>>>>> The differences are minor, so we should be able to deal with it by
>>>>> generating few wrapper functions for the odd cases.
>>>>> Here's what clang does to provide math functions using libdevice:
>>>>>
>> https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>>>> Right now, clang will generate any llvm intrinsic and we crash, so
>> anything
>>>> else is probably a step in the right direction. Eventually, we should
>>>> "lower"
>>>> all intrinsics that the NVPTX backend can't handle or at least emit a
>> nice
>>>> error message. Preferably, clang would know what we can't deal with and
>> not
>>>> generate intinsic calls for those in the first place.
>>>>
>>>>
>>>>> The most concerning aspect of libdevice is that we don't know when
>> we'll
>>>> no
>>>>> longer be able to use the libdevice bitcode? My understanding is that
>> IR
>>>>> does not guarantee binary stability and at some point we may just be
>>>> unable
>>>>> to use it. Ideally we need our own libm for GPUs.
>>>> For OpenMP I did my best to avoid writing libm (code) for GPUs by piggy
>>>> backing on CUDA and libc++ implementations, I hope it will stay that
>> way.
>>>> That said, if the need arises we might really have to port libc++ to the
>>>> GPUs.
>>>>
>>>> Back to the problem with libdevice. I agree that the solution of NVIDIA
>>>> to ship a .bc library is suboptimal but with the existing, or an
>> extended,
>>>> auto-upgrader we might be able to make that work reasonably well for the
>>>> foreseeable future. That problem is orthogonal to what we are discussing
>>>> above, I think.
>>>>
>>>> ~ Johannes
>>>>
>>>>
>>>>> --Artem
>>>>>
>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>> P.S. If the rewrite capability (aka libcall lowering) is generic we
>>>> could
>>>>>>          use the scheme for many other things as well.
>>>>>>
>>>>>>
>>>>>> [0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>>>>>
>>>>>>
>>>>>>> --Artem
>>>>>>>
>>>>>>>
>>>>>>>> ~ Johannes
>>>>>>>>
>>>>>>>>
>>>>>>>>> One thing that may work within the existing compilation model is to
>>>>>>>>> pre-compile the standard library into PTX and then textually embed
>>>>>>>> relevant
>>>>>>>>> functions into the generated PTX, thus pushing the 'linking' phase
>>>> past
>>>>>>>> the
>>>>>>>>> end of LLVM's compilation and make it look closer to the standard
>>>>>>>>> compile/link process. This way we'd only enable libcall lowering in
>>>>>>>> NVPTX,
>>>>>>>>> assuming that the library functions will be magically available out
>>>>>>>> there.
>>>>>>>>> Injection of PTX could be done with an external script outside of
>>>> LLVM
>>>>>>>> and
>>>>>>>>> it could be incorporated into clang driver. Bonus points for the
>> fact
>>>>>>>> that
>>>>>>>>> this scheme is compatible with -fgpu-rdc out of the box -- assemble
>>>> the
>>>>>>>> PTX
>>>>>>>>> with `ptxas -rdc` and then actually link with the library, instead
>> of
>>>>>>>>> injecting its PTX before invoking ptxas.
>>>>>>>>>
>>>>>>>>> --Artem
>>>>>>>>>
>>>>>>>>> Trying to figure out a good way to have the cake and eat it too.
>>>>>>>>>> ~ Johannes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/10/21 2:49 PM, William Moses wrote:
>>>>>>>>>>> Since clang (and arguably any other frontend that uses) should
>> link
>>>>>> in
>>>>>>>>>>> libdevice, could we lower these intrinsics to the libdevice code?
>>>>>>>>> The linking happens *before* LLVM gets to work on IR.
>>>>>>>>> As I said, it's a workaround, not the solution. It's possible for
>>>> LLVM
>>>>>> to
>>>>>>>>> still attempt lowering something in the IR into a libcall and we
>>>> would
>>>>>>>> not
>>>>>>>>> be able to deal with that. It happens to work well enough in
>>>> practice.
>>>>>>>>> Do you have an example where you see the problem with -ffast-math?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> For example, consider compiling the simple device function below:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
>> -emit-llvm
>>>>>>>>>>>        --cuda-path=/usr/local/cuda-11.0
>> -L/usr/local/cuda-11.0/lib64
>>>>>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>>>>>> __device__ double f(double x) {
>>>>>>>>>>>           return cos(x);
>>>>>>>>>>> }
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> The LLVM module for it is as follows:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> ...
>>>>>>>>>>> define dso_local double @_Z1fd(double %x) #0 {
>>>>>>>>>>> entry:
>>>>>>>>>>>         %__a.addr.i = alloca double, align 8
>>>>>>>>>>>         %x.addr = alloca double, align 8
>>>>>>>>>>>         store double %x, double* %x.addr, align 8
>>>>>>>>>>>         %0 = load double, double* %x.addr, align 8
>>>>>>>>>>>         store double %0, double* %__a.addr.i, align 8
>>>>>>>>>>>         %1 = load double, double* %__a.addr.i, align 8
>>>>>>>>>>>         %call.i = call contract double @__nv_cos(double %1) #7
>>>>>>>>>>>         ret double %call.i
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> define internal double @__nv_cos(double %a) #1 {
>>>>>>>>>>>         %q.i = alloca i32, align 4
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Obviously we would need to do something to ensure these functions
>>>>>> don't
>>>>>>>>>> get
>>>>>>>>>>> deleted prior to their use in lowering from intrinsic to
>> libdevice.
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at google.com>
>>>>>> wrote:
>>>>>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <
>>>>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Artem, Justin,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am running into a problem and I'm curious if I'm missing
>>>>>> something
>>>>>>>> or
>>>>>>>>>>>>> if the support is simply missing.
>>>>>>>>>>>>> Am I correct to assume the NVPTX backend does not deal with
>>>>>>>> `llvm.sin`
>>>>>>>>>>>>> and friends?
>>>>>>>>>>>>>
>>>>>>>>>>>> Correct. It can't deal with anything that may need to lower to a
>>>>>>>>>> standard
>>>>>>>>>>>> library call.
>>>>>>>>>>>>
>>>>>>>>>>>>> This is what I see, with some variations:
>>>>>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>>>>>> If this is missing in the backend, is there a plan to get this
>>>>>>>> working,
>>>>>>>>>>>>> I'd really like to have the
>>>>>>>>>>>>> intrinsics in the middle end rather than __nv_cos, not to
>> mention
>>>>>>>> that
>>>>>>>>>>>>> -ffast-math does emit intrinsics
>>>>>>>>>>>>> and crashes.
>>>>>>>>>>>>>
>>>>>>>>>>>> It all boils down to the fact that PTX does not have the
>> standard
>>>>>>>>>>>> libc/libm which LLVM could lower the calls to, nor does it have
>> a
>>>>>>>>>> 'linking'
>>>>>>>>>>>> phase where we could link such a library in, if we had it.
>>>>>>>>>>>>
>>>>>>>>>>>> Libdevice bitcode does provide the implementations for some of
>> the
>>>>>>>>>>>> functions (though with a __nv_ prefix) and clang links it in in
>>>>>> order
>>>>>>>> to
>>>>>>>>>>>> avoid generating IR that LLVM can't handle, but that's a
>>>> workaround
>>>>>>>> that
>>>>>>>>>>>> does not help LLVM itself.
>>>>>>>>>>>>
>>>>>>>>>>>> --Artem
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> ───────────────────
>>>>>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> --Artem Belevich
>>>>>>>>>>>>
>