[all-commits] [llvm/llvm-project] 14d79a: [mlir][NVGPU] nvgpu.mmasync on F32 through TF32
Manish Gupta via All-commits
all-commits at lists.llvm.org
Mon Aug 1 16:24:48 PDT 2022
Branch: refs/heads/main
Home: https://github.com/llvm/llvm-project
Commit: 14d79afeae63d78de9483f750fafaba13c7ae2dc
https://github.com/llvm/llvm-project/commit/14d79afeae63d78de9483f750fafaba13c7ae2dc
Author: Manish Gupta <manigupta at google.com>
Date: 2022-08-01 (Mon, 01 Aug 2022)
Changed paths:
M mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
M mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h
M mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
M mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
M mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
M mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt
A mlir/lib/Dialect/NVGPU/Transforms/MmaSyncTF32Transform.cpp
M mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
M mlir/test/Dialect/NVGPU/invalid.mlir
A mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32.mlir
A mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32x3.mlir
M mlir/test/lib/Dialect/CMakeLists.txt
A mlir/test/lib/Dialect/NVGPU/CMakeLists.txt
A mlir/test/lib/Dialect/NVGPU/TestNVGPUTransforms.cpp
M mlir/tools/mlir-opt/CMakeLists.txt
M mlir/tools/mlir-opt/mlir-opt.cpp
Log Message:
-----------
[mlir][NVGPU] nvgpu.mmasync on F32 through TF32
Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3.
For mma.sync on f32 input using tensor cores there are two possibilites:
(a) tf32 (1 `mma.sync` per warp-level matrix-multiply-accumulate)
(b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate)
Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D130294
More information about the All-commits
mailing list