[all-commits] [llvm/llvm-project] 8548c4: [mlir] GEMM Hopper Tensor Core Integration Test

Sun Mar 3 13:40:23 PST 2024

  Branch: refs/reviewable/pr81478/r4
  Home:   https://github.com/llvm/llvm-project
  Commit: 8548c413400f7adbcc4728b7e33f68d388e2aefe
      https://github.com/llvm/llvm-project/commit/8548c413400f7adbcc4728b7e33f68d388e2aefe
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    A mlir/test/Integration/GPU/CUDA/sm90/python/lit.local.cfg
    A mlir/test/Integration/GPU/CUDA/sm90/python/matmul.py
    A mlir/test/Integration/GPU/CUDA/sm90/python/tools/lit.local.cfg
    A mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
    A mlir/test/Integration/GPU/CUDA/sm90/python/tools/nvgpucompiler.py

  Log Message:
  -----------
  [mlir] GEMM Hopper Tensor Core Integration Test

This test aims to validate the correctness of the supported GEMM kernels in
NVGPU dialects, with current support for Multistage and Warp Specialization
kernels.
The test constructs and metaprograms IR using Python bindings, allowing
generic IR building. This flexibility enables changes to the shape,
tile size, or data type of the GEMM for testing purposes.
The entry function is `matmul`, where one can specify GEMM shape, tile size,
data type, GEMM algorithm (Multistage or Warp Specialization), and the maximum
number of stages.
Verification is done via numpy's matmul operation.

Example:
```
matmul(input_type=np.float16,                # input types
       output_type=np.float32,               # output type
       M=4096, N=4096, K=4096,               # Shape
       BLOCK_M=128, BLOCK_N=128, BLOCK_K=64, # Tile Size
       use_warp_specialization=True,         # Enable Warp Specialization
       max_num_stages=3)                     # Number of stages in shared memory
```
### Parallelism Across CTAs

GEMM includes three loops defining the shape of the GEMM, specified in the
`matmul` function.
The program builds IR using the following loop structure, tiling the loops
with the given tile size and parallelizing the two outermost loops into the
first and second dimensions of CTAs.
```
for(bi = 0; i < M; i += BLOCK_M)          # parallelize across blockIdx.x
    for(bj = 0; j < N; j += BLOCK_N)      # parallelize across blockIdx.y
        for(bk = 0; k < K; K += BLOCK_K)
            for(i = bi; i < (bi + BLOCK_M); ++i)
                for(j = bj; j < (bj + BLOCK_N); ++j)
                    for(k = bk; k < (bk + BLOCK_K); ++k)
```

## Multistage Kernel

This kernel launches a single warp group (128 threads). The primary thread
(pthread) requests load from TMA. Threads collectively wait for the data and
perform mma operations. After completing the shape, threads together store
first fragmented registers to shared memory, then from shared memory to global
memory; this part is called the epilogue.

Execution Timeline of Multistage Kernel with 3 stages:
```
+-------+----------------+--------------------+--------------------+--------------------+-----+-----------------------+
|       |Prologue ---->   |MainLoop ---->                                                                  |Epilogue  |
+-------+----------------+--------------------+--------------------+--------------------+-----+-----------------------+
|pthread|[tma-0,1,2]     |[wait-0][mma][tma-2]|[wait-1][mma][tma-0]|[wait-2][mma][tma-1]| ... | [mma-wait] |[epilogue]|
|wgroup | ........       |[wait-0][mma]       |[wait-1][mma]       |[wait-2][mma]       | ... | [mma-wait] |[epilogue]|
+-------+----------------+--------------------+--------------------+--------------------+-----+-----------------------+
```

## Warp Specialization Kernel

This kernel launches 2 warp groups (2x128 threads) per CTA, specializing one
as `producer warp group` and another as `consumer warp group`. The
`producer warp group` is responsible for requesting TMA load, while the
`consumer warp group` performs the mma operation. The epilogue section is
handled by the `consumer warp group` as its threads own the fragmented registers.

Execution Timeline of Warp Specialization Kernel with 2 stages:
```
+--------+--------+---------+---------+---------+-----------------------+---+--------------+-----------------+
|        |MainLoop ---->                                                    | 1st Epilogue | 2nd Epilogue    |
+--------+--------+---------+---------+---------+-----------------------+---+--------------+-----------------+
|pthread1|[tma-0] | [tma-1] | [tma-0] | [tma-1] | ..........................| ...........  | [shmem->global] |
|wgroup1 | .......|         |         |         |                           |              | [shmem->global] |
+--------+--------+---------+---------+---------+-----------------------+---+--------------+-----------------+
|wgroup2 |[wait-0][mma], [wait-1][mma], [wait-0][mma], [wait-1][mma], ......| [reg->shmem] | [shmem->global]|
+--------+--------+---------+---------+---------+-----------------------+---+--------------+-----------------+
```

  Commit: 379fa54d3583a7ecb0753d12124485c7050dac69
      https://github.com/llvm/llvm-project/commit/379fa54d3583a7ecb0753d12124485c7050dac69
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/matmul.py
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/nvgpucompiler.py

  Log Message:
  -----------
  format with yapf

  Commit: 39f7213b299b316f59555a6b095b57217d0fdaab
      https://github.com/llvm/llvm-project/commit/39f7213b299b316f59555a6b095b57217d0fdaab
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/matmul.py
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/nvgpucompiler.py

  Log Message:
  -----------
  format it with black

  Commit: 9b65ffd405b5f5a241d0129223d04fc0a68730e3
      https://github.com/llvm/llvm-project/commit/9b65ffd405b5f5a241d0129223d04fc0a68730e3
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/matmul.py

  Log Message:
  -----------
  fix the spelling mistake

  Commit: a31f9b17d8b90c5828b8f2373b31573e1add52bc
      https://github.com/llvm/llvm-project/commit/a31f9b17d8b90c5828b8f2373b31573e1add52bc
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py

  Log Message:
  -----------
  address comments

  Commit: 777f2089b144b935ba97d291ea3f66c0622a6e2a
      https://github.com/llvm/llvm-project/commit/777f2089b144b935ba97d291ea3f66c0622a6e2a
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py

  Log Message:
  -----------
  format

  Commit: 0934dcb2d2037f0595e982fc05577fbf8cea3fa6
      https://github.com/llvm/llvm-project/commit/0934dcb2d2037f0595e982fc05577fbf8cea3fa6
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/matmul.py
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py

  Log Message:
  -----------
  Allow multiple stages, and fix the kernels.
Add test_short that test multiple cases.

  Commit: 861fe27fe18f398ada57b5940f01df39e77ea3d6
      https://github.com/llvm/llvm-project/commit/861fe27fe18f398ada57b5940f01df39e77ea3d6
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/matmul.py
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py

  Log Message:
  -----------
  format

  Commit: 89370ef6180d2ce6f0b81ac106ecd623b89f3cda
      https://github.com/llvm/llvm-project/commit/89370ef6180d2ce6f0b81ac106ecd623b89f3cda
  Author: grypp <guray.ozen at gmail.com>
  Date:   2024-03-03 (Sun, 03 Mar 2024)

  Changed paths:
    M mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py

  Log Message:
  -----------
  Add asserts 128x128x64

Compare: https://github.com/llvm/llvm-project/compare/8548c413400f%5E...89370ef6180d

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications