[llvm-dev] NVPTX codegen for llvm.sin (and friends)

Wed Apr 28 15:56:32 PDT 2021

Hi all,

Reviving this thread as Johannes and I recently had some time to take a
look and do some additional design work. We'd love any thoughts on the
following proposal.

*Background:*

Standard math (and potentially other) functions for both AMD and NVIDIA
GPU's don't exist and LLVM-based tools must instead call
architecture-specific functions that perform similar computations.

For example in clang/lib/Headers/__clang_hip_math.h:

__DEVICE__
double sqrt(double __x) { return __ocml_sqrt_f64(__x); }

and clang/lib/Headers/__clang_cuda_math.h:

__DEVICE__ double sqrt(double __a) { return __nv_sqrt(__a); }

In the case of CUDA, the definition of these functions are found by
immediately linking against a corresponding CUDA libdevice.bc

This design presents several problems:

1) It is illegal to use llvm math intrinsics on GPU code as these functions
do not have definitions.

While in theory we could define the lowering of these intrinsics to be a
table which looks up the correct __nv_sqrt, this would require the
definition of all such functions to remain or otherwise be available. As
it's undesirable for the LLVM backend to be aware of CUDA paths, etc, this
means that the original definitions brought in by merging libdevice.bc must
be maintained. Currently these are deleted if they are unused (as libdevice
has them marked as internal).

2) GPU math functions aren't able to be optimized, unlike standard math
functions.

Since LLVM has no idea what these foreign functions are, they cannot be
optimized. This is problematic in two ways. First, these functions to not
have all the relevant attributes one might expect (inaccessiblememonly,
willreturn, etc). Secondly, they cannot benefit from instcombine-style
optimizations that recognize math intrinsic. For example, a call to sin(0)
from source code will remain a call to __ocml_sqrt_f32(0) [if on AMD]
rather than being replaced with 0.

These two design issues make it difficult for tools that wish to
generate GPU code (frontends, target offloading, Enzyme AD tool, etc) as
well as simply being able to optimize it effectively.

*Design Constraints:*

To remedy the problems described above we need a design that meets the
following:
* Does not require modifying libdevice.bc or other code shipped by a
vendor-specific installation
* Allows llvm math intrinsics to be lowered to device-specific code
* Keeps definitions of code used to implement intrinsics until after all
potential relevant intrinsics (including those created by LLVM passes) have
been lowered.

*Initial Design:*

To remedy this we propose a refined version of the implements mechanism
described above. Specifically, consider the example below:

define internal float @my_cos_fast(float %d) {
  ...
}

declare internal float @my_cos(float %d)

define double @foo(double %d, float %f) {
  %c1 = tail call fast double @llvm.cos.f64(double %d)
  %c2 = tail call fast double @cos(double %d)
  ret double %c2
}

declare double @cos(double) !metadata !1
declare double @llvm.cos.f64(double) !metadata !0

!0 = !{!"implemented_by", double(double)* @my_cos}
!1 = !{!"implemented_by", double(double)* @my_cos_fast}

Here, each function that we may want to provide an implementation for (in
this case cos and llvm.cos.f64), has a metadata tag "implemented_by"
followed by the function which it will be replaced with. The type signature
of the original function and its implementation must match.

The implemented_by metadata will be defined to ensure both the replacement
and the replacee will be kept around (for example to ensure that LLVM
passes that generate a call to llvm.cos will still have a definition).

After all passes that could generate such intrinsics and instruction
simplifications have run, a new LLVM optimization pass that replaces uses
of the function with its implementation.

Proposed Patches:

1) Allow metadata on declaration [not just definition]

2) Tell GlobalOpt and other passes not to delete globals using/used in
implemented_by

3) Write implementedby pass that scans all functions, replaces call,
removes metadata

4) Add Clang attributes to expose implements and use in nvptx/amd headers

Cheers,
Billy

On Fri, Mar 12, 2021 at 6:00 PM Artem Belevich via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Fri, Mar 12, 2021 at 2:39 PM James Y Knight <jyknight at google.com>
> wrote:
>
>>
>>
>> On Fri, Mar 12, 2021 at 1:51 PM Artem Belevich via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Also, bitcode is platform specific. I can imagine building a bitcode
>>>> file during the
>>>> build but shipping one means you have to know ABI and datalayout or
>>>> hope they are the same everywhere.
>>>>
>>>
>>> Agreed. We will likely need multiple variants. We will compile
>>> specifically for NVPTX or AMDGPU and we will know specific ABI and the data
>>> layout for them regardless of the host we're building on.
>>>
>>> It appears to me is the the difference vs what we have now is that we'll
>>> need to have the libm sources somewhere, the process to build them for
>>> particular GPUs (that may need to be done out of the tree as it may need
>>> CUDA/HIP SDKs) and having to incorporate such libraries into llvm
>>> distribution.
>>>
>>> OK. I'll agree that that may be a bit too much for now.
>>>
>>
>> It sounded before like you were saying the library should effectively
>> be function aliases for standard libm names, to call __nv_ names. Isn't it
>> utterly trivial to generate such a bitcode file as part of the toolchain
>> build, without requiring any external SDKs?
>>
>
> That's true for most, but not all functions provided by libdevice. We'd
> still need something that's a bit more involved.
>
> --Artem
>
>
>
>
>
> --
> --Artem Belevich
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210428/7f892679/attachment.html>