[all-commits] [llvm/llvm-project] 14d79a: [mlir][NVGPU] nvgpu.mmasync on F32 through TF32

Mon Aug 1 16:24:48 PDT 2022

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 14d79afeae63d78de9483f750fafaba13c7ae2dc
      https://github.com/llvm/llvm-project/commit/14d79afeae63d78de9483f750fafaba13c7ae2dc
  Author: Manish Gupta <manigupta at google.com>
  Date:   2022-08-01 (Mon, 01 Aug 2022)

  Changed paths:
    M mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
    M mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h
    M mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
    M mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
    M mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
    M mlir/lib/Dialect/NVGPU/Transforms/CMakeLists.txt
    A mlir/lib/Dialect/NVGPU/Transforms/MmaSyncTF32Transform.cpp
    M mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
    M mlir/test/Dialect/NVGPU/invalid.mlir
    A mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32.mlir
    A mlir/test/Dialect/NVGPU/mma-sync-f32-to-tf32x3.mlir
    M mlir/test/lib/Dialect/CMakeLists.txt
    A mlir/test/lib/Dialect/NVGPU/CMakeLists.txt
    A mlir/test/lib/Dialect/NVGPU/TestNVGPUTransforms.cpp
    M mlir/tools/mlir-opt/CMakeLists.txt
    M mlir/tools/mlir-opt/mlir-opt.cpp

  Log Message:
  -----------
  [mlir][NVGPU] nvgpu.mmasync on F32 through TF32

Adds optional attribute to support tensor cores on F32 datatype by lowering to `mma.sync` with TF32 operands. Since, TF32 is not a native datatype in LLVM we are adding `tf32Enabled` as an attribute to allow the IR to be aware of `MmaSyncOp` datatype. Additionally, this patch adds placeholders for nvgpu-to-nvgpu transformation targeting higher precision tf32x3.

For mma.sync on f32 input using tensor cores there are two possibilites:
(a) tf32   (1 `mma.sync` per warp-level matrix-multiply-accumulate)
(b) tf32x3 (3 `mma.sync` per warp-level matrix-multiply-accumulate)

Typically, tf32 tensor core acceleration comes at a cost of accuracy from missing precision bits. While f32 has 23 precision bits, tf32 has only 10 precision bits. tf32x3 aims to recover the precision bits by splitting each operand into two tf32 values and issue three `mma.sync` tensor core operations.

Reviewed By: ThomasRaoux

Differential Revision: https://reviews.llvm.org/D130294