[all-commits] [llvm/llvm-project] 52db7e: [mlir][nvgpu] Improve `WarpgroupAccumulator` type ...

Tue Oct 17 02:47:00 PDT 2023

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 52db7e27458f774fa0c6c6a864ce197fa071a230
      https://github.com/llvm/llvm-project/commit/52db7e27458f774fa0c6c6a864ce197fa071a230
  Author: Guray Ozen <guray.ozen at gmail.com>
  Date:   2023-10-17 (Tue, 17 Oct 2023)

  Changed paths:
    M mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
    M mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h
    M mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
    M mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
    M mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
    M mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
    M mlir/test/Dialect/NVGPU/invalid.mlir

  Log Message:
  -----------
  [mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR  (#68728)

`WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type
that keeps the accumulator matrix that is used by warp-group level
matrix multiplication. It is handy to have a special type for that as
the matrix is distributed among the threads of the warp-group. However,
current transformations requires to create and use multiple
`WarpgroupAccumulator` if the shape of GEMM is larger than the supported
shape of `wgmma.mma_async` instruction. This makes IR looks dense.

This PR improves the transformation of `WarpgroupAccumulator` type in
every nvgpu Op that uses it.

**Example: Current GEMM in NVGPU-IR**
```
// Init
%m1, %m2 = nvgpu.warpgroup.mma.init.accumulator ->  
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
                    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>

// GEMM
%r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
      !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>  

// Epilogue 
nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```

**Example: This PR simplifies the IR as below:**
```
// Init
%m = nvgpu.warpgroup.mma.init.accumulator ->  
           !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>

// GEMM
%r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}: 
      !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
      !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>> 
      -> 
      !nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>  

// Epilogue 
nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer
  : !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>, 
    !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
    into memref<128x128xf32,3>
```