[all-commits] [llvm/llvm-project] 52db7e: [mlir][nvgpu] Improve `WarpgroupAccumulator` type ...
Guray Ozen via All-commits
all-commits at lists.llvm.org
Tue Oct 17 02:47:00 PDT 2023
Branch: refs/heads/main
Home: https://github.com/llvm/llvm-project
Commit: 52db7e27458f774fa0c6c6a864ce197fa071a230
https://github.com/llvm/llvm-project/commit/52db7e27458f774fa0c6c6a864ce197fa071a230
Author: Guray Ozen <guray.ozen at gmail.com>
Date: 2023-10-17 (Tue, 17 Oct 2023)
Changed paths:
M mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
M mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h
M mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
M mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
M mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
M mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
M mlir/test/Dialect/NVGPU/invalid.mlir
Log Message:
-----------
[mlir][nvgpu] Improve `WarpgroupAccumulator` type to simplify IR (#68728)
`WarpgroupAccumulator` (or `!nvgpu.warpgroup.accumulator`) is a type
that keeps the accumulator matrix that is used by warp-group level
matrix multiplication. It is handy to have a special type for that as
the matrix is distributed among the threads of the warp-group. However,
current transformations requires to create and use multiple
`WarpgroupAccumulator` if the shape of GEMM is larger than the supported
shape of `wgmma.mma_async` instruction. This makes IR looks dense.
This PR improves the transformation of `WarpgroupAccumulator` type in
every nvgpu Op that uses it.
**Example: Current GEMM in NVGPU-IR**
```
// Init
%m1, %m2 = nvgpu.warpgroup.mma.init.accumulator ->
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
// GEMM
%r1, %r2 = nvgpu.warpgroup.mma %descA, %descB, %m1, %m2 {transposeB}:
!nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
->
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
// Epilogue
nvgpu.warpgroup.mma.store [%r1, %r2] to %sharedMemoryBuffer
: !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
into memref<128x128xf32,3>
```
**Example: This PR simplifies the IR as below:**
```
// Init
%m = nvgpu.warpgroup.mma.init.accumulator ->
!nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>
// GEMM
%r1 = nvgpu.warpgroup.mma %descA, %descB, %m1 {transposeB}:
!nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>
->
!nvgpu.warpgroup.accumulator<fragmented = vector<128x128xf32>>
// Epilogue
nvgpu.warpgroup.mma.store [%matrixD1, %matrixD2] to %sharedMemoryBuffer
: !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
into memref<128x128xf32,3>
```
More information about the All-commits
mailing list