[cfe-dev] Heterogeneous target attributes overloading in Clang CUDA (__CUDA_ARCH__ considered harmful)

Fri Oct 26 18:04:23 PDT 2018

Today, CUDA C++ has a macro that can be used to distinguish which architecture
(either the host architecture, a specific device architecture, or any device
architecture) code is currently being compiled for.

When CUDA code is compiled for the host, __CUDA_ARCH__ is not defined. When it
is compiled for the device, it is defined to a value that indicates the SM architecture.

At face value, this seems like a useful way to customize how heterogeneous code
is implemented on a particular architecture:

  __host__ __device__
  uint32_t iadd3(uint32_t x, uint32_t y, uint32_t z) {
  #if __CUDA_ARCH__ >= 200
    asm ("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(x) : "r"(x), "r"(y), "r"(z));
  #else
    x = x + y + z;
  #endif
    return x;
  }

However, __CUDA_ARCH__ is only well suited to a split compilation CUDA compiler,
like NVCC, which uses a separate host compiler (GCC, Clang, MSVC, etc) and device
compiler, preprocessing and compiling your code once for each target architecture
(once for the host, and one time for each target device architecture).

__CUDA_ARCH__ has some caveats, however. The NVCC compiler has to see all kernel
function declarations (e.g. __global__ functions) during both host and device
compilation, to generate the host side launch stubs and the actual device side
kernel code. Otherwise, NVCC may not compile the device side kernel code, either
because it believes it is unused or because it is never instantiated (in the case
of a template kernel function). This, regretably, will not fail at compile time,
but instead fails at runtime when you attempt to launch the (non-existant) kernel.

Consider the following code. It unconditionally calls `parallel::reduce_n_impl`
on the host, which instantiates some (unseen) template kernel functions during
host compilation. However, in device code, if THRUST_HAS_CUDART is false,
`parallel::reduce_n_impl` is never instantiated and the actual device code for
the kernel functions are never compiled.

  #if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))
     // We're either not compiling as device code, or we are compiling as device
     // code and we can launch kernels from device code (SM 3.5 and higher +
     // relocatable device code is required for the device side runtime which is
     // needed to do device side launches).
  #  define THRUST_HAS_CUDART 1
  #else
  #  define THRUST_HAS_CUDART 0
  #endif

  namespace thrust {

  #pragma nv_exec_check_disable
  template <typename Derived,
            typename InputIt,
            typename Size,
            typename T,
            typename BinaryOp>
  __host__ __device__
  T reduce_n(execution_policy<Derived>& policy,
             InputIt                    first,
             Size                       num_items,
             T                          init,
             BinaryOp                   binary_op)
  {
    // Broken version:
    #if THRUST_HAS_CUDART
      return system::cuda::reduce_n_impl(policy, first, num_items, init, binary_op);
    #else
      // We are running on the device and there is no device side runtime, so we
      // can't launch a kernel to do the reduction in parallel. Instead, we just
      // do a sequential reduction in the calling thread.
      return system::sequential::reduce_n_impl(first, num_items, init, binary_op);
    #endif
  }

  } // namespace thrust

Instead, we end up using the rather odd pattern of adding a (non-constexpr) if
statement whose condition is known at compile time. This ensures the kernel function
is instantiated during device compilation, even though it is not actually used.
Fortunately, while NVCC can as-if optimize away the if statement, it cannot treat
the instantiation as unused.

  #pragma nv_exec_check_disable
  template <typename Derived,
            typename InputIt,
            typename Size,
            typename T,
            typename BinaryOp>
  __host__ __device__
  T reduce_n(execution_policy<Derived>& policy,
             InputIt                    first,
             Size                       num_items,
             T                          init,
             BinaryOp                   binary_op)
  {
    if (THRUST_HAS_CUDART)
      return parallel::reduce_n_impl(policy, first, num_items, init, binary_op);

    #if !THRUST_HAS_CUDART
      // We are running on the device and there is no device side runtime, so we
      // can't launch a kernel to do the reduction in parallel. Instead, we just
      // do a sequential reduction in the calling thread.
      return sequential::reduce_n_impl(first, num_items, init, binary_op);
    #endif
  }

For more background, see:

https://github.com/NVlabs/cub/issues/30
https://stackoverflow.com/questions/51248770/cuda-arch-flag-with-thrust-execution-policy

For a merged parse CUDA compiler, like Clang CUDA, __CUDA_ARCH__ is a poor fit,
because as a textual macro it can be used to completely change the code that
the compiler consumes during host and device compilation, essentially forcing
separate preprocessing and parsing.

Clang CUDA offers one alternative today, __host__ / __device__ overloading,
which is better suited to a merged parse model:

  __device__
  uint32_t iadd3(uint32_t x, uint32_t y, uint32_t z) {
    asm ("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(x) : "r"(x), "r"(y), "r"(z));
    return x;
  }

  __host__
  uint32_t iadd3(uint32_t x, uint32_t y, uint32_t z) {
    return x + y + z;
  }

However, this approach does not allow us to customize code for specific device
architectures. Note that the above code will not compile on SM 1.0 devices, as
the inline assembly contains instructions unavailable on those platforms.

Tuning for specific device architectures is critical for high performance CUDA
libraries, like Thrust. We need to be able to select different algorithms and
use architecture specific facilities to get speed of light performance.

Fortunately, there is some useful prior art. Clang (and GCC) has a related feature,
__attribute__((target("..."))), which can be used to define a function "overloaded"
on the architecture it is compiled for. One common use case for this feature is
implementing functions that utilize micro-architecture specific CPU SIMD
instructions:

  using double4 = double __attribute__((__vector_size__(32)));

  __attribute__((target("sse")))
  double4 fma(double4d x, double4 y, double4 z);

  __attribute__((target("avx")))
  double4 fma(double4d x, double4 y, double4 z);

  __attribute__((target("default")))
  double4 fma(double4d x, double4 y, double4 z); // "Fallback" implementation.

This attribute can also be used to target specific architectures:

  __attribute__((target("arch=atom")))
  void foo(); // Will be called on 'atom' processors.

  __attribute__((target("default")))
  void foo(); // Will be called on any other processors.

This could easily be extended for heterogeneous compilation:

  __attribute__((target("host:arch=skylake")))
  void foo();

  __attribute__((target("arch=atom")))
  void foo(); // Implicitly "host:arch=atom".

  __attribute__((target("host:default")))
  void foo();

  __attribute__((target("device:arch=sm_20")))
  void foo();

  __attribute__((target("device:arch=sm_60")))
  void foo();

  __attribute__((target("device:default")))
  void foo();

Or, perhaps more concisely, we could introduce this shorthand:

  __host__("arch=skylake")
  void foo();

  __host__
  void foo(); // Implicitly "host:default".

  __device__("arch=sm_20")
  void foo();

  __device__("arch=sm_60")
  void foo();

  __device__ // Implicitly "device:default".
  void foo();

Another place that we use _CUDA_ARCH__ today in Thrust and CUB is in
metaprogramming code that selects the correct "strategies" that should be
used to implement a particular algorithm:

  enum arch {
    host,
    sm_30, sm_32, sm_35, // Kepler
    sm_50, sm_52, sm_53, // Maxwell
    sm_60, sm_61, sm_62, // Pascal
    sm_70,               // Volta
    sm_72, sm_75         // Turing
  };

  __host__ __device__
  constexpr arch select_arch()
  {
    switch (__CUDA_ARCH__)
    {
      // ...
    };
  }

  template <class T, arch Arch = select_arch()>
  struct radix_sort_tuning;

  template <class T>
  struct radix_sort_tuning<T, sm_35>
  {
    constexpr size_t INPUT_SIZE = sizeof(T);

    constexpr size_t NOMINAL_4B_ITEMS_PER_THREAD = 11;
    constexpr size_t ITEMS_PER_THREAD
      = std::min(NOMIMAL_4B_ITEMS_PER_THREAD,
          std::max(1, (NOMIMAL_4B_ITEMS_PER_THREAD * 4 / INPUT_SIZE)));

    constexpr size_t BLOCK_THREADS = 256;
    constexpr auto BLOCK_LOAD_STRATEGY = BLOCK_LOAD_WARP_TRANSPOSE;
    constexpr auto CACHE_LOAD_STRATEGY = LOAD_LDG;
    constexpr auto BLOCK_STORE_STRATEGY = BLOCK_STORE_WARP_TRANSPOSE;
  };

  template <typename T>
  struct radix_sort_tuning<T, sm_50> { /* ... */ };

  // ...

With heterogeneous target attributes, we could implement select_arch like
so:

  __host__
  constexpr arch select_arch() { return host; }

  __device__("arch=sm_30")
  constexpr arch select_arch() { return sm_30; }

  __device__("arch=sm_35")
  constexpr arch select_arch() { return sm_35; }

  // ...

You could also potentially use this with if constexpr:

  void foo()
  {
    // Moral equivalent of #if __CUDA_ARCH__
    if constexpr (host != select_arch())
      // ...
    else
      // ...
  }

This feature would also make it much easier to port some of the more tricky parts
of libc++ to GPUs, such as iostreams and concurrency primitives.

It would be awesome if we could take __host__ / __device__ overloading a step
further and make it a full fledged replacement for __CUDA_ARCH__. It would provide
a possible future migration path away from __CUDA_ARCH__, which would enable us to
move to true merged parsing for heterogeneous C++: preprocess once, parse once,
perform platform-agnostic optimizations once, code gen multiple times.

So, questions:

- Can target attributes go on constexpr functions today?
- Does anyone have suggestions for how this approach could be improved? Alternatives?
- Is there interest in this in Clang CUDA?

------------------------------------------------------
Bryce Adelstein Lelbach aka wash
ISO C++ LEWGI Chair
CppCon and C++Now Program Chair
Thrust Maintainer, HPX Developer
CUDA Convert and Reformed AVX Junkie

Ask "Dumb" Questions
------------------------------------------------------

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20181027/80abd2ce/attachment.html>