[all-commits] [llvm/llvm-project] edf5ca: [mlir][gpu] Support Cluster of Thread Blocks in `g...

Mon Nov 27 02:05:21 PST 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: edf5cae7391cdb097a090ea142dfa7ac6ac03555
      https://github.com/llvm/llvm-project/commit/edf5cae7391cdb097a090ea142dfa7ac6ac03555
  Author: Guray Ozen <guray.ozen at gmail.com>
  Date:   2023-11-27 (Mon, 27 Nov 2023)

  Changed paths:
    M mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
    M mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
    M mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
    M mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
    M mlir/lib/Dialect/GPU/IR/InferIntRangeInterfaceImpls.cpp
    M mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp
    M mlir/lib/Target/LLVMIR/Dialect/GPU/SelectObjectAttr.cpp
    M mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir
    M mlir/test/Dialect/GPU/invalid.mlir
    M mlir/test/Dialect/GPU/ops.mlir
    A mlir/test/Integration/GPU/CUDA/sm90/cga_cluster.mlir
    M mlir/test/Target/LLVMIR/gpu.mlir

  Log Message:
  -----------
  [mlir][gpu] Support Cluster of Thread Blocks in `gpu.launch_func` (#72871)

NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.

This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.

The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.

An example of this implementation can be seen below:

```
gpu.launch_func @kernel_module::@kernel
                clusters in (%1, %0, %0) // <-- Optional
                blocks in (%0, %0, %0)
                threads in (%0, %0, %0)
```

The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:

```
%cidX = gpu.cluster_id  x
%cidY = gpu.cluster_id  y
%cidZ = gpu.cluster_id  z

%cdimX = gpu.cluster_dim  x
%cdimY = gpu.cluster_dim  y
%cdimZ = gpu.cluster_dim  z
```

We will introduce cluster support in `gpu.launch` Op in an upcoming PR. 

See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.