[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Johannes Doerfert via llvm-dev llvm-dev at lists.llvm.org
Fri Mar 12 08:48:49 PST 2021


I prototyped the LLVM-Core parts last night:

https://reviews.llvm.org/D98516

If this is something we support I'll write an RFC, also
for the missing clang parts.

~ Johannes

[EOM]


On 3/11/21 7:59 PM, Artem Belevich wrote:
> On Thu, Mar 11, 2021 at 4:10 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> On 3/11/21 1:37 PM, Artem Belevich wrote:
>>> On Thu, Mar 11, 2021 at 10:54 AM Johannes Doerfert <
>>> johannesdoerfert at gmail.com> wrote:
>>>
>>>> I certainly agree we should try to avoid a hard-coded mapping
>>>> in C++.
>>>>
>>>> I could see something like:
>>>>
>>>> ```
>>>> __attribute__((implementation("llvm.cos"))
>>>> double __nv_cos(...) { ... }
>>>>
>>>> ```
>>>>
>>>> and a pass that transforms all calls to a function with an
>>>> "implementation" to calls to that implementation. Maybe
>>>> later we attach a score/priority ;)
>>>>
>>>>
>>> I'm not sure how that would work.
>>> Where would you place that `__attribute__((implementation))` ? We do not
>>> have the definitions for `__nv_*` as they come from NVIDIA-provided
>>> bitcode. We could add the attribute to the declaration in
>>> `__clang_cuda_libdevice_declares.h`.
>>> How does LLVM handle the differences in function attributes between
>>> function declaration and definition? Will there be trouble when we link
>> in
>>> the actual __nv_cos from the libdevice that would not have that
>> attribute?
>>> Another potential gotcha is that for the functions that can't be directly
>>> mapped 1:1 to `__nv_*` counterparts, we'd still need to provide the
>>> implementation ourselves. We will not know whether the implementation
>> will
>>> be used until after the substitution pass, so we'll need to make sure
>> it's
>>> not DCE'd until then. It appears to be the same issue (though on a
>> smaller
>>> scale) as with linking in libdevice directly.
>>>
>>> Let's take a step back and figure out what are the issues we want to
>> solve.
>>> The top-level goal is to provide implementation for LLVM intrinsics. For
>>> now let's stick with libm-related ones.
>>> What we have is the libdevice bitcode which uses different function names
>>> and provides a subset of the functionality we need.
>>> What we miss is
>>>     - something to connect LLVM's libcalls to the GPU-side implementation,
>>>     - additional code to provide implementations for the functions that
>> are
>>> missing or different in libdevice.
>>>
>>> Considering that we want this to work in LLVM, the additional code would
>>> have to be a bitcode and it would have to exist in addition to libdevice.
>>> Our options for the mapping between LLVM intrinsics and the
>> implementation
>>> are
>>> * intrinsic -> __nv_* equivalent mapping pass
>>>      This would still need additional bitcode for the missing/different
>>> functions.
>>> * lower libcalls to the standard libm APIs, implement libm -> __nv_*
>>> mapping in our own bitcode.
>>>
>>> Considering that additional bitcode is needed in both cases, I believe
>> that
>>> the second approach makes more sense.
>> I really hope to avoid any additional bitcode, there are too many
>> drawbacks and basically no benefits, IMHO.
>>
> Could you elaborate on the drawbacks?
>
> The fact is that we already depend on the external bitcode (libdevice in
> this case), though right now we're trying to keep that to clang only. The
> current approach is not sound in principle and is rather brittle in
> practice. Nor clang is the only source of the IR for the LLVM to
> compile, so it leaves LLVM-only users without a good solution. There are
> already a handful of JIT compilers that each do their own gluing of
> libdevice into the IR they want to compile for NVPTX. I think we do have a
> very good reason to deal with that in LLVM itself.
>
> While I agree that additional bitcode is a hassle, I think it would be a
> net positive change for LLVM usability for NVPTX users.
> The external bitcode would not be required for those who do not need
> libdevice now, so the change should not be disruptive.
>
>>> LLVM does not need to know or care about what's provided by libdevice,
>> and
>>> we'd have more flexibility, compared to what we could do in the mapping
>>> pass. It also makes it easy to substitute a different implementation, if
>> we
>>> have or need one.
>> I agree that LLVM (core) should not know about __nv_*, that's why I
>> suggested
>> the `__attribute__((implements("...")))` approach. My preferred solution
>> is still to annotate our declarations of __nv_* and point to the
>> llvm.intrinsics (name) from there. If we have a missing mapping, we
>> point to an
>> intrinsic from a definition that lives in the Clang headers next to the
>> __nv_*  declarations.
>>
> We may have slightly different end goals in mind.
> I was thinking of making the solution work for LLVM. I.e. users would be
> free to use llvm.sin with NVPTX back-end with a few documented steps needed
> to make it work (basically "pass additional
> -link-libm-bitcode=path/to/bitcode_libm.bc").
>
> Your scenario above suggests that the goal is to allow clang to generate
> both llvm intrinsics and the glue which would then be used by LLVM to make
> it work for clang, but not in general. It's an improvement compared to what
> we have now, but I still think we should try a more general solution.
>
>
>> This does not yet work because -mlink-builtin-bitcode (which I assume
>> triggers the llvm-link logic) will drop the attributes of a declaration
>> if a definition is found. I think that should not be the case anyway
>> such that the union of attributes is set.
>>
>> The benefit I see for the above is that the mapping is tied to the
>> declarations and doesn't live in a tablegen file far away. It works well
>> even if we can't map 1:1, and we could even restrict the "used" attribute
>> to anything that has an "implements" attribute.
>
> I do not think we need tablegen for anything here. I was thinking of just
> compiling a real math library (or a wrapper on top of libdevice) from C/C++
> sources.
>
> Our approaches are not mutually exclusive. If there's a strong opposition
> to providing a bitcode libm for NVPTX, implementing it somewhere closer to
> clang would still be an improvement, even if it's not as general as I'd
> like. It should still be possible to allow LLVM to lower libcalls in NVPTX
> to standard libm API, enabled with a flag, and just let the end users who
> are interested (e.g. JITs) to provide their own implementation.
>
> --Artem
>
>
>
>> So:
>>
>> ```
>> __nv_A() { ... } // called, inlined and optimized as before, DCE'ed after.
>>
>> __nv_B() { ... } // not called, DCE'ed.
>>
>> __attribute__((implements("llvm.C"))
>> __nv_C() { ... } // calls are inlined and optimized as before, not DCE'ed
>>                    // though because of the attribute. Replaces llvm.C as
>>                    // callee in the special pass.
>> ```
>>
>> So "implements" gives you a way to statically replace a function
>> declaration
>> or definition with another one. I could see it being used to provide other
>> intrinsics to platforms with backends that don't support them.
>>
>> Does that make some sense?
>>
>> ~ Johannes
>>
>>
>>> WDYT?
>>>
>>> --Artem
>>>
>>>
>>>> On 3/10/21 8:44 PM, William Moses wrote:
>>>>> We could also consider doing something slightly broader.
>>>>>
>>>>> For example we could define a special attribute on top of the llvm.cos
>>>>> call/declaration etc with metadata or an attribute that points to the
>>>>> actual __nv_cos function. Then in a subsequent lowering pass the
>>>>> corresponding intrinsic with the relevant attribute has its uses
>> replaced
>>>>> by the actual function.
>>>>>
>>>>>
>>>>> On Wed, Mar 10, 2021 at 7:57 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> On 3/10/21 6:22 PM, Artem Belevich wrote:
>>>>>>> On Wed, Mar 10, 2021 at 3:44 PM Johannes Doerfert <
>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>
>>>>>>>> On 3/10/21 4:38 PM, Artem Belevich wrote:
>>>>>>>>> On Wed, Mar 10, 2021 at 1:55 PM Johannes Doerfert <
>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> On 3/10/21 3:25 PM, Artem Belevich wrote:
>>>>>>>>>>> On Wed, Mar 10, 2021 at 12:57 PM Johannes Doerfert <
>>>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Right. We could keep the definition of __nv_cos and friends
>>>>>>>>>>>> around. Right now, -ffast-math might just crash on the user,
>>>>>>>>>>>> which is arguably a bad thing. I can also see us benefiting
>>>>>>>>>>>> in various other ways from llvm.cos uses instead of __nv_cos
>>>>>>>>>>>> (assuming precision is according to the user requirements but
>>>>>>>>>>>> that is always a condition).
>>>>>>>>>>>>
>>>>>>>>>>>> It could be as simple as introducing __nv_cos into
>>>>>>>>>>>> "llvm.used" and a backend matching/rewrite pass.
>>>>>>>>>>>>
>>>>>>>>>>>> If the backend knew the libdevice location it could even pick
>>>>>>>>>>>> the definitions from there. Maybe we could link libdevice late
>>>>>>>>>>>> instead of eager?
>>>>>>>>>>>>
>>>>>>>>>>> It's possible, but it would require plumbing in CUDA SDK
>> awareness
>>>>>> into
>>>>>>>>>>> LLVM. While clang driver can deal with that, LLVM currently
>> can't.
>>>>>> The
>>>>>>>>>>> bitcode library path would have to be provided by the user.
>>>>>>>>>> The PTX backend could arguably be CUDA SDK aware, IMHO, it would
>>>>>>>>>> even be fine if the middle-end does the remapping to get inlining
>>>>>>>>>> and folding benefits also after __nv_cos is used. See below.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The standard library as bitcode raises some questions.
>>>>>>>>>> Which standard library? CUDAs libdevice is a bitcode library,
>> right?
>>>>>>>>> It's whatever LLVM will need to lower libcalls to. libdevice
>> bitcode
>>>> is
>>>>>>>> the
>>>>>>>>> closest approximation of that we have at the moment.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> * When do we want to do the linking? If we do it at the
>> beginning,
>>>>>> then
>>>>>>>>>> the
>>>>>>>>>>> question is how to make sure unused functions are not eliminated
>>>>>> before
>>>>>>>>>> we
>>>>>>>>>>> may need them, as we don't know apriori what's going to be
>> needed.
>>>> We
>>>>>>>>>> also
>>>>>>>>>>> do want the unused functions to be gone after we're done. Linking
>>>> it
>>>>>> in
>>>>>>>>>>> early would allow optimizing the code better at the expense of
>>>> having
>>>>>>>> to
>>>>>>>>>>> optimize a lot of code we'll throw away. Linking it in late has
>>>> less
>>>>>>>>>>> overhead, but leaves the linked in bitcode unoptimized, though
>> it's
>>>>>>>>>>> probably in the ballpark of what would happen with a real library
>>>>>> call.
>>>>>>>>>>> I.e. no inlining, etc.
>>>>>>>>>>>
>>>>>>>>>>> * It incorporates linking into LLVM, which is not LLVM's job.
>>>>>> Arguably,
>>>>>>>>>> the
>>>>>>>>>>> line should be drawn at the lowering to libcalls as it's done for
>>>>>> other
>>>>>>>>>>> back-ends. However, we're also constrained to by the need to have
>>>> the
>>>>>>>>>>> linking done before we generate PTX which prevents doing it after
>>>>>> LLVM
>>>>>>>> is
>>>>>>>>>>> done generating an object file.
>>>>>>>>>> I'm confused. Clang links in libdevice.bc early.
>>>>>>>>> Yes. Because that's where it has to happen if we want to keep LLVM
>>>>>>>> unaware
>>>>>>>>> of CUDA SDK.
>>>>>>>>> It does not have to be the case if/when LLVM can do the linking
>>>> itself.
>>>>>>>>>> If we make sure
>>>>>>>>>> `__nv_cos` is not deleted early, we can at any point "lower"
>>>>>> `llvm.cos`
>>>>>>>>>> to `__nv_cos` which is available. After the lowering we can remove
>>>>>>>>>> the artificial uses of `__nv_XXX` functions that we used to keep
>> the
>>>>>>>>>> definitions around in order to remove them from the final result.
>>>>>>>>>>
>>>>>>>>> This is the 'link early' approach, I should've been explicit that
>>>> it's
>>>>>>>>> 'link early *everything*' as opposed to linking only what's needed
>> at
>>>>>> the
>>>>>>>>> beginning.
>>>>>>>>> It would work at the expense of having to process/optimize 500KB
>>>> worth
>>>>>> of
>>>>>>>>> bitcode for every compilation, whether it needs it or not.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> We get the benefit of having `llvm.cos` for some of the pipeline,
>>>>>>>>>> we know it does not have all the bad effects while `__nv_cos` is
>>>>>> defined
>>>>>>>>>> with inline assembly. We also get the benefit of inlining
>> `__nv_cos`
>>>>>>>>>> and folding the implementation based on the arguments. Finally,
>>>>>>>>>> this should work with the existing pipeline, the linking is the
>> same
>>>>>>>>>> as before, all we do is to keep the definitions alive longer and
>>>>>>>>>> lower `llvm.cos` to `__nv_cos` in a middle end pass.
>>>>>>>>>>
>>>>>>>>> Again, I agree that it is doable.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> This might be similar to the PTX solution you describe below but I
>>>>>> feel
>>>>>>>>>> we get the inline benefit from this without actually changing the
>>>>>>>> pipeline
>>>>>>>>>> at all.
>>>>>>>>>>
>>>>>>>>> So, to summarize:
>>>>>>>>> * link the library as bitcode early, add artificial placeholders
>> for
>>>>>>>>> everything, compile, remove placeholders and DCE unused stuff away.
>>>>>>>>>        Pros:
>>>>>>>>>           - we're already doing most of it before clang hands hands
>> off
>>>>>> IR to
>>>>>>>>> LLVM, so it just pushes it a bit lower in the compilation.
>>>>>>>>>        Cons:
>>>>>>>>>           - runtime cost of optimizing libdevice bitcode,
>>>>>>>>>           - libdevice may be required for all NVPTX compilations?
>>>>>>>>>
>>>>>>>>> * link the library as bitcode late.
>>>>>>>>>         Pros:
>>>>>>>>>           - lower runtime cost than link-early approach.
>>>>>>>>>         Cons:
>>>>>>>>>           - We'll need to make sure that NVVMReflect pass processes
>> the
>>>>>>>> library.
>>>>>>>>>           - less optimizations on the library functions. Some of the
>>>> code
>>>>>>>> gets
>>>>>>>>> DCE'ed away after NVVMReflect and the rest could be optimized
>> better.
>>>>>>>>>           - libdevice may be required for all NVPTX compilations?
>>>>>>>>> * 'link' with the library as PTX appended as text to LLVM's output
>>>> and
>>>>>>>> let
>>>>>>>>> ptxas do the 'linking'
>>>>>>>>>        Pros:  LLVM remains agnostic of CUDA SDK installation
>> details.
>>>> All
>>>>>> it
>>>>>>>>> does is allows lowering libcalls and leaves their resolution to the
>>>>>>>>> external tools.
>>>>>>>>>        Cons: Need to have the PTX library somewhere and need to
>>>> integrate
>>>>>> the
>>>>>>>>> 'linking' into the compilation process somehow.
>>>>>>>>>
>>>>>>>>> Neither is particularly good. If the runtime overhead of link-early
>>>> is
>>>>>>>>> acceptable, then it may be a winner here, by a very small margin.
>>>>>>>>> link-as-PTX may be better conceptually as it keeps linking and
>>>>>>>> compilation
>>>>>>>>> separate.
>>>>>>>>>
>>>>>>>>> As for the practical steps, here's what we need:
>>>>>>>>> - allow libcall lowering in NVPTX, possibly guarded by a flag. This
>>>> is
>>>>>>>>> needed for all of the approaches above.
>>>>>>>>> - teach LLVM how to link in bitcode (and, possibly, control
>>>> early/late
>>>>>>>> mode)
>>>>>>>>> - teach clang driver to delegate libdevice linking to LLVM.
>>>>>>>>>
>>>>>>>>> This will allow us to experiment with all three approaches and see
>>>> what
>>>>>>>>> works best.
>>>>>>>> I think if we embed knowledge about the nv_XXX functions we can
>>>>>>>> even get away without the cons you listed for early linking above.
>>>>>>>>
>>>>>>> WDYM by `embed knowledge about the nv_XXX functions`? By linking
>> those
>>>>>>> functions in? Of do you mean that we should just declare them
>>>>>>> before/instead of linking libdevice in?
>>>>>> I mean by providing the "libcall lowering" pass. So the knowledge
>>>>>> that llvm.cos maps to __nv_cos.
>>>>>>
>>>>>>>> For early link I'm assuming an order similar to [0] but I also
>> discuss
>>>>>>>> the case where we don't link libdevice early for a TU.
>>>>>>>>
>>>>>>> That link just describes the steps needed to use libdevice. It does
>> not
>>>>>>> deal with how/where it fits in the LLVM pipeline.
>>>>>>> The gist is that NVVMreflect replaces some conditionals with
>> constants.
>>>>>>> libdevice uses that as a poor man's IR preprocessor, conditionally
>>>>>> enabling
>>>>>>> different implementations and relying on DCE and constant folding to
>>>>>> remove
>>>>>>> unused parts and eliminate the now useless branches.
>>>>>>> While running NVVM alone will make libdevice code valid and usable,
>> it
>>>>>>> would still benefit from further optimizations. I do not know to what
>>>>>>> degree, though.
>>>>>>>
>>>>>>>
>>>>>>>> Link early:
>>>>>>>> 1) clang emits module.bc and links in libdevice.bc but with the
>>>>>>>>         `optnone`, `noinline`, and "used" attribute for functions in
>>>>>>>>         libdevice. ("used" is not an attribute but could as well be.)
>>>>>>>>         At this stage module.bc might call __nv_XXX or llvm.XXX
>> freely
>>>>>>>>         as defined by -ffast-math and friends.
>>>>>>>>
>>>>>>> That could work. Just carrying extra IR around would probably be OK.
>>>>>>> We may want to do NVVMReflect as soon as we have it linked in and,
>>>> maybe,
>>>>>>> allow optimizing the functions that are explicitly used already.
>>>>>> Right. NVVMReflect can be run twice and with `alwaysinline`
>>>>>> on the call sites of __nv_XXX functions we will actually
>>>>>> inline and optimize them while the definitions are just "dragged
>>>>>> along" in case we need them later.
>>>>>>
>>>>>>
>>>>>>>> 2) Run some optimizations in the middle end, maybe till the end of
>>>>>>>>         the inliner loop, unsure.
>>>>>>>> 3) Run a libcall lowering pass and another NVVMReflect pass (or the
>>>>>>>>         only instance thereof). We effectively remove all llvm.XXX
>> calls
>>>>>>>         in favor of __nv_XXX now. Note that we haven't spend (much)
>> time
>>>>>>>>         on the libdevice code as it is optnone and most passes are
>> good
>>>>>>>>         at skipping those. To me, it's unclear if the used parts
>> should
>>>>>>>>         not be optimized before we inline them anyway to avoid
>> redoing
>>>>>>>>         the optimizations over and over (per call site). That needs
>>>>>>>>         measuring I guess. Also note that we can still retain the
>>>> current
>>>>>>>>         behavior for direct calls to __nv_XXX if we mark the call
>> sites
>>>>>>>>         as `alwaysinline`, or at least the behavior is almost like
>> the
>>>>>>>>         current one is.
>>>>>>>> 4) Run an always inliner pass on the __nv_XXX calls because it is
>>>>>>>>         something we would do right now. Alternatively, remove
>> `optnone`
>>>>>>>>         and `noinline` from the __nv_XXX calls.
>>>>>>>> 5) Continue with the pipeline as before.
>>>>>>>>
>>>>>>>>
>>>>>>> SGTM.
>>>>>>>
>>>>>>>
>>>>>>>> As mentioned above, `optnone` avoids spending time on the libdevice
>>>>>>>> until we "activate" it. At that point (globals) DCE can be scheduled
>>>>>>>> to remove all unused parts right away. I don't think this is (much)
>>>>>>>> more expensive than linking libdevice early right now.
>>>>>>>>
>>>>>>>> Link late, aka. translation units without libdevice:
>>>>>>>> 1) clang emits module.bc but does not link in libdevice.bc, it will
>> be
>>>>>>>>         made available later. We still can mix __nv_XXX and llvm.XXX
>>>> calls
>>>>>>>>         freely as above.
>>>>>>>> 2) Same as above.
>>>>>>>> 3) Same as above.
>>>>>>>> 4) Same as above but effectively a no-op, no __nv_XXX definitions
>> are
>>>>>>>>         available.
>>>>>>>> 5) Same as above.
>>>>>>>>
>>>>>>>>
>>>>>>>> I might misunderstand something about the current pipeline but from
>>>> [0]
>>>>>>>> and the experiments I run locally it looks like the above should
>> cover
>>>>>> all
>>>>>>>> the cases. WDYT?
>>>>>>>>
>>>>>>>>
>>>>>>> The `optnone` trick may indeed remove much of the practical
>> differences
>>>>>>> between the early/late approaches.
>>>>>>> In principle it should work.
>>>>>>>
>>>>>>> Next question is -- is libdevice sufficient to satisfy LLVM's
>>>> assumptions
>>>>>>> about the standard library.
>>>>>>> While it does provide most of the equivalents of libm functions, the
>>>> set
>>>>>> is
>>>>>>> not complete and some of the functions differ from their libm
>>>>>> counterparts.
>>>>>>> The differences are minor, so we should be able to deal with it by
>>>>>>> generating few wrapper functions for the odd cases.
>>>>>>> Here's what clang does to provide math functions using libdevice:
>>>>>>>
>> https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h
>>>>>> Right now, clang will generate any llvm intrinsic and we crash, so
>>>> anything
>>>>>> else is probably a step in the right direction. Eventually, we should
>>>>>> "lower"
>>>>>> all intrinsics that the NVPTX backend can't handle or at least emit a
>>>> nice
>>>>>> error message. Preferably, clang would know what we can't deal with
>> and
>>>> not
>>>>>> generate intinsic calls for those in the first place.
>>>>>>
>>>>>>
>>>>>>> The most concerning aspect of libdevice is that we don't know when
>>>> we'll
>>>>>> no
>>>>>>> longer be able to use the libdevice bitcode? My understanding is that
>>>> IR
>>>>>>> does not guarantee binary stability and at some point we may just be
>>>>>> unable
>>>>>>> to use it. Ideally we need our own libm for GPUs.
>>>>>> For OpenMP I did my best to avoid writing libm (code) for GPUs by
>> piggy
>>>>>> backing on CUDA and libc++ implementations, I hope it will stay that
>>>> way.
>>>>>> That said, if the need arises we might really have to port libc++ to
>> the
>>>>>> GPUs.
>>>>>>
>>>>>> Back to the problem with libdevice. I agree that the solution of
>> NVIDIA
>>>>>> to ship a .bc library is suboptimal but with the existing, or an
>>>> extended,
>>>>>> auto-upgrader we might be able to make that work reasonably well for
>> the
>>>>>> foreseeable future. That problem is orthogonal to what we are
>> discussing
>>>>>> above, I think.
>>>>>>
>>>>>> ~ Johannes
>>>>>>
>>>>>>
>>>>>>> --Artem
>>>>>>>
>>>>>>>
>>>>>>>> ~ Johannes
>>>>>>>>
>>>>>>>>
>>>>>>>> P.S. If the rewrite capability (aka libcall lowering) is generic we
>>>>>> could
>>>>>>>>           use the scheme for many other things as well.
>>>>>>>>
>>>>>>>>
>>>>>>>> [0] https://llvm.org/docs/NVPTXUsage.html#linking-with-libdevice
>>>>>>>>
>>>>>>>>
>>>>>>>>> --Artem
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ~ Johannes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> One thing that may work within the existing compilation model is
>> to
>>>>>>>>>>> pre-compile the standard library into PTX and then textually
>> embed
>>>>>>>>>> relevant
>>>>>>>>>>> functions into the generated PTX, thus pushing the 'linking'
>> phase
>>>>>> past
>>>>>>>>>> the
>>>>>>>>>>> end of LLVM's compilation and make it look closer to the standard
>>>>>>>>>>> compile/link process. This way we'd only enable libcall lowering
>> in
>>>>>>>>>> NVPTX,
>>>>>>>>>>> assuming that the library functions will be magically available
>> out
>>>>>>>>>> there.
>>>>>>>>>>> Injection of PTX could be done with an external script outside of
>>>>>> LLVM
>>>>>>>>>> and
>>>>>>>>>>> it could be incorporated into clang driver. Bonus points for the
>>>> fact
>>>>>>>>>> that
>>>>>>>>>>> this scheme is compatible with -fgpu-rdc out of the box --
>> assemble
>>>>>> the
>>>>>>>>>> PTX
>>>>>>>>>>> with `ptxas -rdc` and then actually link with the library,
>> instead
>>>> of
>>>>>>>>>>> injecting its PTX before invoking ptxas.
>>>>>>>>>>>
>>>>>>>>>>> --Artem
>>>>>>>>>>>
>>>>>>>>>>> Trying to figure out a good way to have the cake and eat it too.
>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/10/21 2:49 PM, William Moses wrote:
>>>>>>>>>>>>> Since clang (and arguably any other frontend that uses) should
>>>> link
>>>>>>>> in
>>>>>>>>>>>>> libdevice, could we lower these intrinsics to the libdevice
>> code?
>>>>>>>>>>> The linking happens *before* LLVM gets to work on IR.
>>>>>>>>>>> As I said, it's a workaround, not the solution. It's possible for
>>>>>> LLVM
>>>>>>>> to
>>>>>>>>>>> still attempt lowering something in the IR into a libcall and we
>>>>>> would
>>>>>>>>>> not
>>>>>>>>>>> be able to deal with that. It happens to work well enough in
>>>>>> practice.
>>>>>>>>>>> Do you have an example where you see the problem with
>> -ffast-math?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> For example, consider compiling the simple device function
>> below:
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> // /mnt/sabrent/wmoses/llvm13/build/bin/clang tmp.cu -S
>>>> -emit-llvm
>>>>>>>>>>>>>         --cuda-path=/usr/local/cuda-11.0
>>>> -L/usr/local/cuda-11.0/lib64
>>>>>>>>>>>>> --cuda-gpu-arch=sm_37
>>>>>>>>>>>>> __device__ double f(double x) {
>>>>>>>>>>>>>            return cos(x);
>>>>>>>>>>>>> }
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> The LLVM module for it is as follows:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> ...
>>>>>>>>>>>>> define dso_local double @_Z1fd(double %x) #0 {
>>>>>>>>>>>>> entry:
>>>>>>>>>>>>>          %__a.addr.i = alloca double, align 8
>>>>>>>>>>>>>          %x.addr = alloca double, align 8
>>>>>>>>>>>>>          store double %x, double* %x.addr, align 8
>>>>>>>>>>>>>          %0 = load double, double* %x.addr, align 8
>>>>>>>>>>>>>          store double %0, double* %__a.addr.i, align 8
>>>>>>>>>>>>>          %1 = load double, double* %__a.addr.i, align 8
>>>>>>>>>>>>>          %call.i = call contract double @__nv_cos(double %1) #7
>>>>>>>>>>>>>          ret double %call.i
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> define internal double @__nv_cos(double %a) #1 {
>>>>>>>>>>>>>          %q.i = alloca i32, align 4
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> Obviously we would need to do something to ensure these
>> functions
>>>>>>>> don't
>>>>>>>>>>>> get
>>>>>>>>>>>>> deleted prior to their use in lowering from intrinsic to
>>>> libdevice.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 10, 2021 at 3:39 PM Artem Belevich <tra at google.com
>>>>>>>> wrote:
>>>>>>>>>>>>>> On Wed, Mar 10, 2021 at 11:41 AM Johannes Doerfert <
>>>>>>>>>>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Artem, Justin,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am running into a problem and I'm curious if I'm missing
>>>>>>>> something
>>>>>>>>>> or
>>>>>>>>>>>>>>> if the support is simply missing.
>>>>>>>>>>>>>>> Am I correct to assume the NVPTX backend does not deal with
>>>>>>>>>> `llvm.sin`
>>>>>>>>>>>>>>> and friends?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Correct. It can't deal with anything that may need to lower
>> to a
>>>>>>>>>>>> standard
>>>>>>>>>>>>>> library call.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is what I see, with some variations:
>>>>>>>>>> https://godbolt.org/z/PxsEWs
>>>>>>>>>>>>>>> If this is missing in the backend, is there a plan to get
>> this
>>>>>>>>>> working,
>>>>>>>>>>>>>>> I'd really like to have the
>>>>>>>>>>>>>>> intrinsics in the middle end rather than __nv_cos, not to
>>>> mention
>>>>>>>>>> that
>>>>>>>>>>>>>>> -ffast-math does emit intrinsics
>>>>>>>>>>>>>>> and crashes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It all boils down to the fact that PTX does not have the
>>>> standard
>>>>>>>>>>>>>> libc/libm which LLVM could lower the calls to, nor does it
>> have
>>>> a
>>>>>>>>>>>> 'linking'
>>>>>>>>>>>>>> phase where we could link such a library in, if we had it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Libdevice bitcode does provide the implementations for some of
>>>> the
>>>>>>>>>>>>>> functions (though with a __nv_ prefix) and clang links it in
>> in
>>>>>>>> order
>>>>>>>>>> to
>>>>>>>>>>>>>> avoid generating IR that LLVM can't handle, but that's a
>>>>>> workaround
>>>>>>>>>> that
>>>>>>>>>>>>>> does not help LLVM itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Artem
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ~ Johannes
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> ───────────────────
>>>>>>>>>>>>>>> ∽ Johannes (he/his)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> --Artem Belevich
>>>>>>>>>>>>>>
>


More information about the llvm-dev mailing list