[PATCH] D77954: [CUDA][HIP] Fix host/device based overload resolution

Yaxun Liu via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Fri Apr 24 05:22:26 PDT 2020


yaxunl marked 6 inline comments as done.
yaxunl added inline comments.


================
Comment at: clang/lib/Sema/SemaOverload.cpp:9749
+  if (isBetterMultiversionCandidate(Cand1, Cand2))
+    return true;
+
----------------
rjmccall wrote:
> tra wrote:
> > rjmccall wrote:
> > > erichkeane wrote:
> > > > yaxunl wrote:
> > > > > echristo wrote:
> > > > > > rjmccall wrote:
> > > > > > > yaxunl wrote:
> > > > > > > > rjmccall wrote:
> > > > > > > > > If we move anything below this check, it needs to figure out a tri-state so that it can return false if `Cand2` is a better candidate than `Cand1`.  Now, that only matters if multiversion functions are supported under CUDA, but if you're relying on them not being supported, that should at least be commented on.
> > > > > > > > multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion in device, host device, and global functions are not supported. However this change does not make things worse, and should continue to work if they are supported.
> > > > > > > > 
> > > > > > > > host/device based overloading resolution is mostly for determining viability of a function. If two functions are both viable, other factors should take precedence in preference. This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion.
> > > > > > > > 
> > > > > > > > I will make isBetterMultiversionCandidate three states.
> > > > > > > > This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion.
> > > > > > > 
> > > > > > > Well, but the multiversion people could say the same: that multiversioning is for picking an alternative among otherwise-identical functions, and HD and H functions are not otherwise-identical.
> > > > > > > 
> > > > > > > CC'ing @echristo for his thoughts on the right ordering here.
> > > > > > Adding @erichkeane here as well.
> > > > > > 
> > > > > > I think this makes sense, but I can see a reason to multiversion a function that will run on host and device. A version of some matrix mult that takes advantage of 3 host architectures and one cuda one? Am I missing something here?
> > > > > My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, sandybridge, ivybridge, shouldn't they be more like
> > > > > 
> > > > > ```
> > > > > __host__ __attribute__((cpu_specific(sandybridge))) f();
> > > > > __host__ __attribute__((cpu_specific(ivybridge))) f();
> > > > > __device__ __attribute__((cpu_specific(gfx900))) f();
> > > > > __device__ __attribute__((cpu_specific(gfx906))) f();
> > > > > ```
> > > > > instead of all `__device__ __host__` functions?
> > > > IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to be just one.  Otherwise I'm not sure how the resolver behavior is supposed to work.  The whole idea is that the definition is chosen at runtime.
> > > > 
> > > > Unless __host__ __device void foo(); is TWO declaration chains (meaning two separate AST entries), it doesn't make sense to have multiverison on it (and then, how it would be spelled is awkward/confusing to me).
> > > > 
> > > > In the above case, if those 4 declarations are not 2 separate root- AST nodes, multiversioning won't work.
> > > There are certainly functions that ought to be usable from either host or device context — any inline function that just does ordinary language things should be in that category.  Also IIUC many declarations are *inferred* to be `__host__ __device__`, or can be mass-annotated with pragmas, and those reasons are probably the main ones this might matter — we might include a header in CUDA mode that declares a multi-versioned function, and we should handle it right.
> > > 
> > > My read of how CUDA programmers expect this to work is that they see the `__host__` / `__device__` attributes as primarily a mechanism for catching problems where you're using the wrong functions for the current configuration.  That is, while we allow overloading by `__host__`/`__device__`-ness, users expect those attributes to mostly be used as a filter for what's "really there" rather than really strictly segregating the namespace.  So I would say that CUDA programmers would probably expect the interaction with multiversioning to be:
> > > 
> > > - Programmers can put `__host__`, `__device__`, or both on a variant depending on where it was usable.
> > > - Dispatches should simply ignore any variants that aren't usable for the current configuration.
> > > 
> > > And specifically they would not expect e.g. a `__host__` dispatch function to only consider `__host__` variants — it should be able to dispatch to anything available, which is to say, it should also include `__host__ __device__` variants.  Similarly (and probably more usefully), a `__host__ __device__` dispatch function being compiled for the device should also consider pure `__device__` functions, and so on.
> > > 
> > > If we accept that, then I think it gives us a much better idea for how to resolve the priority of the overload rules.  The main impact of `isBetterMultiversionCandidate` is to try to ensure that we're looking at the `__attribute__((cpu_dispatch))` function instead of one of the `__attribute__((cpu_specific))` variants.  (It has no effect on `__attribute__((target))` multi-versioning, mostly because it doesn't need to: target-specific variants don't show up in lookup with `__attribute__((target))`.)  That rule should take precedence over the CUDA preference for exact matches, because e.g. if we're compiling this:
> > > 
> > > ```
> > > __host__ __device__ int magic(void) __attribute__((cpu_dispatch("...")));
> > > __host__ __device__ int magic(void) __attribute__((cpu_specific(generic)));
> > > __host__ int magic(void) __attribute__((cpu_specific(mmx)));
> > > __host__ int magic(void) __attribute__((cpu_specific(sse)));
> > > __device__ int magic(void) __attribute__((cpu_specific(some_device_feature)));
> > > __device__ int magic(void) __attribute__((cpu_specific(some_other_device_feature)));
> > > ```
> > > 
> > > then we don't want the compiler to prefer a CPU-specific variant over the dispatch function just because one of the variant was marked `__host__`.
> > It's a bit more complicated and a bit less straightforward than that. :-(  https://goo.gl/EXnymm
> > Handling of target attributes is where clang is very different from the NVCC, so no matter which mental model of "CUDA programmer" you pick, there's another one which will not match. 
> > 
> > In the existing code `__host__ __device__` is commonly used as a sledgehammer to work around NVCC's limitations. It does not allow attribute-based overloading, so the only way you can specialize a function for host/device is via something like this:
> > ```
> > __host__ __device__ void foo() {
> > #if __CUDA_ARCH__ > 0
> >  // GPU code
> > #else
> >  // CPU code.
> > #endif
> > }
> > ```
> > 
> > With clang you can write separate overloaded functions and we'll do our best to pick the one you meant to call. Alas, there are cases where it's ambiguous and depends on the callee's attributes, which may depend on theirs. When something ends up being called from different contexts, interesting things start happening. With more functions becoming constexpr (those are implicitly HD), we'll be running into such impossible-to-do-the-right-thing situations more often. The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. 
> > 
> > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand kernel launch is already a form of runtime dispatch, only it's CUDA runtime does the dispatching based on the GPU one attempts to run the kernel on. `__device__` functions are always compiled for the specific GPU variant. Also, GPU variants often have different instruction sets and can't be mixed together in the same object file at all, so there's no variants once we're running the code as it's already compiled for precisely the GPU we're running on. Almost. Technically GPUs in the same family do share the same instruction sets, but I'm not sure runtime dispatch would buy us much there as the hardware differences are relatively minor.
> > 
> > The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces.
> 
> Okay.  Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD.
> 
> Although... I suppose we *do* want to consider H vs. HD before looking at the more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the number of architectures in the dispatch.  Honestly, though, those just seem like bad rules that we should drop from the code.
> 
> > Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. 
> 
> I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right?  Still, I guess the easiest thing would just be to forbid multiversioned functions on the device.
Will change back the precedence of multiversion to be over host/device.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77954/new/

https://reviews.llvm.org/D77954





More information about the cfe-commits mailing list