[PATCH] D73979: [HIP] Allow non-incomplete array type for extern shared var

Tue Feb 4 14:05:45 PST 2020

tra added a subscriber: rsmith.
tra added a comment.

In D73979#1857728 <https://reviews.llvm.org/D73979#1857728>, @yaxunl wrote:

> All extern shared vars are sharing the same address, however, they may be used as different types in different functions.
>
> For example,
>
>   __device__ int foo() {
>     extern __shared__ int a;
>     for (...) a+=...;
>     return a;
>   }
>  
>   __device__ double bar(int x) {
>     extern __shared__ double b[10];
>     for(...) b[x]+=...;
>     return b[0];
>   }
>  
>   __global__ void k() {
>     foo();
>     //...
>     bar();
>   }
>  
>

I do agree that it is possible to use `extern __shared__` given enough care. My point is that as a feature it is ill-designed, very easy to misuse and creates more problems than it's worth, especially in non-trivial code.
As an illustration, what if bar needs to call foo() and foo lives in a header file somewhere else? Whoever implements foo must make sure that nothing else in the transitive call chain uses `extern __shared__`. That's hard to guarantee in practice and it's very easy to introduce new dependencies without even being aware of them. I.e. via an intermediate function which is not aware that the caller and callee have this restriction.

There are no compiler checks to warn you about it, you you will only know about the problem when you encounter data corruption at runtime and in machine learning applications that may go unnoticed for a long time.

> In one function foo, users need to use the shared memory as an int. In another function, users need to use the shared memory as a double array. Users just need to make sure they request sufficient dynamic shared memory in triple chevron to be greater than the max dynamic shared memory usage.

Again, that requires complete knowledge of who uses this construct. Without compiler's help that's hard to guarantee outside of simple use cases. I can not imagine using it as is in something like thrust or eigen. In fact, thrust does provide `extern_shared_ptr` specifically to serve the same kind of API that I proposed.

> Users do not need to pass values in extern shared var between functions. They just treat it as an uninitialized variable.

I'm OK with that, but they *do* need to make it explicit that they are dealing with externally allocated memory and they do need to pass something to identify which chunk of that memory they operate upon. It may be the pointer, or it may be an offset to be used relative to an `extern __shared__` base.

> Forbidding different types for extern shared variable does not add any benefit, just forcing users to work around the limitation and resulting in less readable code.

That's where we disagree. I believe that in this case it would be a net benefit to be even more restrictive with `extern __shared__` than we are right now and get users to explicitly treat `extern __shared__` as externally allocated memory.

It does not take all that much code to make it work and it does result in more robust code. E.g. your example can be rewritten like this:

  // sprinkle static_casts as necessary.
  void *get_shmem(size_t offset){
    extern __shared__ char shmem[];
    return  &shmem[offset];
  }

  __device__ int foo(int *a) {
    for (...) *a+=...;
    return *a;
  }

  __device__ double bar(double *bx) {
    for(...) *bx+=...;
    return *bx;
  }

  __global__ void k() {
    foo(get_shmem(0));
    //...
    bar(get_shmem(0));
  }

Net benefits that I see:

- functions are composable now -- one can call foo from bar and vice versa and ensure they don't step on each other's toes.
- it's clear that they do operate on the same buffer, when called from `k` -- arguably that's the place where it matters.
- It's easy to change if you need them to work on different sub-buffers.
- foo/bar are not limited to working on shared memory only
- foo/bar can execute in diverged branches, if given non-overlapping buffers. The original example would potentially fail in interesting ways if `k` does something like this:

  __global__ void k() {
    if (threadIdx.x < 16)
       foo();
    else 
       bar();
  }

It's much easier not to open this can of works than clean it up afterwards when you grow more users that depend on it.
I don't think it should be enabled for CUDA, and don't think that it would be a good idea for HIP, either.
Perhaps we need a third opinion from someone with a broader perspective.

@rsmith -- do you have an opinion on what should be done with a tactically useful, but strategically unsound features in general and this CUDA-specific oddity specifically?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73979/new/

https://reviews.llvm.org/D73979