[PATCH] D91516: [AMDGPU] Support for device scope shared variables

Sun Nov 15 23:09:24 PST 2020

hsmhsm created this revision.
hsmhsm added reviewers: b-sumner, t-tye, rampitec, arsenm, yaxunl.
Herald added subscribers: llvm-commits, kerbowa, hiraditya, tpr, dstuttard, mgorny, nhaehnle, jvesely, kzhuravl.
Herald added a project: LLVM.
hsmhsm requested review of this revision.
Herald added subscribers: sstefan1, wdng.
Herald added a reviewer: jdoerfert.

[1]. What does this module pass is about?

This module pass is about an "indirect" method for supporting the "shared"
variables which are defined within "device" functions. Programmatically
speaking, a "hacked" way of supporting device scoped shared variables. Yes,
it is a hacked way, because, it has been become very complicated to support
device scoped shared variables due to below "main" reason:

A shared variable is a "block" scoped variable, but it's lifetime is same as
the "work-group" to which it belongs, which necessitates to keep track of
multiple copies of shared variables related to different work-groups from
different running kernels, which is too costly and too complex to implement
given the "scarcity" of the shared memory and "strange properties" of the
shared variables.

[2]. What does this pass do?

Though, programmer define deviced scoped shared variables within device
functions, this module pass,

  A. internally pushes all those deviced scoped shared variables within the
     associated kernel(s),
  B. appropriately inserts new arguments within in the call graph paths from
     kernel(s) to device function(s), and
  C. implements necessary program transformations.

[3]. How does the implementation of the pass look like in brief?

At a very high level, implementation of this pass can be described as below:

  A. For every kernel, traverse through it's call graph, and collect the
     direct shared variables which are defined within the kernel and
     indirect shared variables which are defined within the device functions
     which appear in the kernel's call graph.
  B. Create a single big shared memory layout within the kernel by combining
     all the direct and indirect shared variables which are collected above.
  C. Map each direct and indirect shared variable to it's "offset" in this
     big shared memory layout.
  D. Pass the "offsets" to (indirect) shared variables as new function
     arguments along the call graph paths from kernel to device functions
     within which the original shared variables are defined.
  E. Replace all the references to original shared variables by their offset
     counterparts.
  F. Finally, remove all the original shared variables.

[4]. What are the positive consequences of this pass on the applications?

Is one really exist?

[5]. What are the negative consequences of this pass on the applications?

  A. Since we add new arguments to the functions along the call graph paths
     from kernels to device functions, there is a posibility of increased
     register pressure, which may affect performance.
  B. Since we create single big shared memory layouts within kernels, we
     land-up duplicating share memory of indirect shared variables within
     kernels, however, it is still less costlier compare to the direct
     support.
  C. This implementation is indeed a very careful hack, and hence any bug in
     the implementation may have some adverse effect on running application.

[6]. What are some important current limitations of the pass?

  A. At present, shared variables of integer types, floating-point types,
     shared variable arrays of integer types and floating-point types are
     supported, other aggregate types like struct are not yet supprted.
  B. The implementation assumes that there are no recursive calls to device
     functions (both direct or indirect), and hence call graphs associated
     with kernels are acyclic.
  C. All TODOs need to be revisited sooner than later.

[7]. An Example.

Before Pass:

  __device__ void foo() {
     __shared__ char smc[10];
     __shared__ int smi[10];
     __shared__ float smf[10];
  }
  __global__ void kernel() {
    foo();
  }

After Pass:

  __device__ void foo(char *smc, int *smi, float *smf) {
  }
  __global__ void kernel() {
    __shared__ char sm[90];  assuming char occupies 1 byte, int occupies
                             4 bytes, and float occupies 4 bytes.
    foo((char*)sm, (int*)(sm + 10), (float*)(sm + 50));
  }

NOTE: This pass is disabled by default, and enabled with the AMDGPU back-end
      option `--amdgpu-enable-device-scope-shared-variable=true`.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D91516

Files:
  llvm/lib/Target/AMDGPU/AMDGPU.h
  llvm/lib/Target/AMDGPU/AMDGPUAlwaysInlinePass.cpp
  llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp
  llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
  llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
  llvm/lib/Target/AMDGPU/CMakeLists.txt
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-deep-function-calls.ll
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-2d-array.ll
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-with-different-data-types.ll
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-within-function.ll
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-within-kernel-and-function.ll
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-within-kernel.ll
  llvm/test/CodeGen/AMDGPU/device-scope-lds-test-two-lds-arguments.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D91516.305415.patch
Type: text/x-patch
Size: 97669 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201116/81e34995/attachment-0001.bin>