[PATCH] D95560: [CUDA][HIP] Fix function scope static variable

Tue Feb 2 11:08:40 PST 2021

tra added inline comments.

================
Comment at: clang/lib/Sema/SemaDecl.cpp:7247-7250
+  // CUDA/HIP: Function-scope static variables in device or global functions
+  // have implicit device or constant attribute. Function-scope static variables
+  // in host device functions have implicit device or constant attribute in
+  // device compilation only.
----------------
yaxunl wrote:
> tra wrote:
> > This is somewhat confusing. I guess the issue is that we're conflating all the functionality implied by the `__device__` attribute and the `accessible on device` which is a subset of it. For the static vars in D functions you only need for it to be accessible on device, IMO. For HD functions, you do need the full `__device__` functionality, with host shadow and runtime registration.
> > 
> > While adding implicit `__device__` works for statics in the device-only functions, it's a bit of an overkill. It also gives us a somewhat different AST between host/device compilations.
> > 
> > Perhaps we can handle statics in device-only functions w/o adding implicit `__device__`. Can we check the parent of the variable instead when we check whether we're allowed to reference the variable? 
> Before we consider a function scope static variable without explicit device attribute, let's consider the difference between a static variable with explicit device attribute and a global device variable. They are both emitted in device compilation and have shadow variables in host compilation. The only difference is the linkage. A global device variable is supposed to be visible to other compilation units, whereas a static device variable is supposed to be visible to the same compilation unit only. A function scope static variable with device attribute has similar traits: It needs to be emitted in device compilation, and it needs a shadow variable in host compilation in case it needs to be accessed in host code. The only difference is that it is only visible inside the function.
> 
> Now let's consider a static var without device attribute in a device function. From sema and codegen point of view, it should have difference from a function scope static var with device attribute. Adding an implicit device attribute would simplify its handling.
> 
> Now let's consider a static var without device attribute in a host device function. The following code is valid for both nvcc and cuda-clang:
> 
> ```
> int __device__ __host__ func(int x) {
>   static int a = 1;
>   return a + x;
> }
> ```
> This requires the static variable is directly accessible in both device and host compilation. This requires that in device compilation, the static var behaves like a static var with explicit device attribute, whereas in host compilation, the static var behaves like a normal host static var. By adding implicit device attribute, we can clearly distinguish these situations and reuse the sema and codegen logic of device attribute.
> A function scope static variable with device attribute has similar traits: It needs to be emitted in device compilation, and it needs a shadow variable in host compilation in case it needs to be accessed in host code. 

This is the part I don't agree with. Can you give me an example how a local variable in a `__device__` function can be accessed from the host code?

One can't refer to local static vars from outside of the function and even if the function returns the address, it will make no sense on the host side, because there's no reverse `device-address to host shadow` registration. I do not think we need host shadow or registration for device-side local statics. What do I miss?

> Now let's consider a static var without device attribute in a device function. From sema and codegen point of view, it should have difference from a function scope static var with device attribute. Adding an implicit device attribute would simplify its handling.

I agree that it makes things simpler. What I'm saying is that the simple solution comes with an overhead that's not needed. 

>
> ```
> int __device__ __host__ func(int x) {
>   static int a = 1;
>   return a + x;
>}
>```
> 
> This requires the static variable is directly accessible in both device and host compilation. 
> This requires that in device compilation, the static var behaves like a static var with explicit device attribute, 
> whereas in host compilation, the static var behaves like a normal host static var. 

I'm not sure I follow your reasoning. `directly accessible in both device and host compilation.` would need an equivalent of `__managed__` attribute. Regular `__device__` variables only allow the variable to have an address on the host side which can then be translated into device-side address by the runtime. The variable is only directly accessible from device.

> By adding implicit device attribute, we can clearly distinguish these situations and reuse the sema and codegen logic of device attribute.

While this approach does remove that shadow+registration overhead, it does not give both host and device access to the same variable and it creates more divergence between host and device AST, which I'd prefer to avoid, if possible.

To summarize, we appear to agree on what we want in the end -- a variable accessible on its respective side only w/o overhead of the shadown and registration. What we disagree on is on how to implement it.
Your approach is to add `__device__` attibute only during device-side compilation only, which allows using parts of the functionality that happes to do the right thing in the individual compilation at the price of AST divergence.
I think that AST divergence should be avoided and that we should have a uniform way of handling local static vars on both sides.

Also, we'll need to figure out and document how static vars are expected to work in HD functions. Should they be implicitly `__managed__`? That would be the most intuitively sensible thing, but it's not going to work with CUDA as we don't support `__managed__` yet.

We could explicitly say that both host and device have their own instance of the local static variable. It's sort of how it works in practice now, but it's deviating of what a user would expect from a static var. It's probably a more natural fit for CUDA/HIP programming model in general. E.g. consider that we may be running on more than one GPU. In order for a static var to work for all GPUs and the host, it should live on the host and then be memory-mapped on each device. I'm not sure if `__managed__` can handle that in principle for CUDA. Each-carries-their own approach is more consistent -- that's how we treat global variables anyways.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D95560/new/

https://reviews.llvm.org/D95560