[all-commits] [llvm/llvm-project] 17649a: [MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for...

Fri Sep 22 08:09:57 PDT 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 17649a7726d3ce1ddba2bbf3ef73af03ea204753
      https://github.com/llvm/llvm-project/commit/17649a7726d3ce1ddba2bbf3ef73af03ea204753
  Author: Guray Ozen <guray.ozen at gmail.com>
  Date:   2023-09-22 (Fri, 22 Sep 2023)

  Changed paths:
    M mlir/include/mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h
    M mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
    M mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
    M mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
    M mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
    M mlir/test/Dialect/NVGPU/tmaload-transform.mlir

  Log Message:
  -----------
  [MLIR][NVGPU] Introduce `nvgpu.mbarrier.group` for multiple mbarrier use  (#65951)

A common practice involves the creation of multiple `mbarrier` objects,
see an example below. This is particularly valuable in scenarios like
software pipelining for GEMM, where we need to generate multiple
barriers dynamically use and wait them in a loop.

PR improves `nvgpu.mbarrier.barrier` type into the
`nvgpu.mbarrier.group`. All `mbarrier` related Ops now uses this type.
Consequently, these Ops are now capable of managing multiple barriers
seamlessly.

Having `num_barriers = 4` helps us to locate mbarrier object(s) into
static shared memory. We could make the value dynamic that requires
dynamic shared memory it would complicate the codegen.

```
%barriers = nvgpu.mbarrier.create -> !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c0], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c1], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c2], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
nvgpu.mbarrier.init %barriers[%c3], %num_threads : !nvgpu.mbarrier.group<3, num_barriers = 4>
...
scf.for %i = %c0 to %n step %c1 {
  nvgpu.mbarrier.try_wait %barriers[ (i % 4) ] ... 

  // ... Do work once mbarrier is ready 

  nvgpu.mbarrier.arrive.expect_tx %barriers[ (i + 3 % 4) ] ... 
}
```
We will have mbarrier usages like below: 
```
expect_tx[0]
expect_tx[1]
expect_tx[2]
Loop:
 try_wait mbarrier[0], expect_tx[3]
 try_wait mbarrier[1], expect_tx[0]
 try_wait mbarrier[2], expect_tx[1]
 try_wait mbarrier[3], expect_tx[2]
...
```