[PATCH] D158059: [AMDGPU/wmma] - Disable 3-address syntax for f16

Sebastian Neubauer via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Sep 8 08:22:30 PDT 2023


sebastian-ne added a comment.

I’ll take a try at explaining what the new tied-intrinsic does and where it is useful:

Our frontend sees matrix multiplications of multiple, different 16-bit matrices. Each of these matrices takes 8 VGPRs. However, a 16-bit matrix only uses either the high or the low half of each of the 8 VGPRs.
So, what our compiler tries to do, is merging a pair of two independent 16-bit matrices into 8 VGPRs. One matrix taking the low half of each register, one matrix taking the high half.

In IR, we have intrinsics that work on such a combined matrix-pair:

  %combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)  ; type is like <16 x half> with every second half in use
  %low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
  %high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %low_half_multiplied, i1 true /* high half */)

So far so good, we divide VGPR usage by two, now we need to lower our `@wmma` intrinsic to llvm.amdgcn intrinsics.
For this, we need a wmma intrinsic that uses the low or high half of our value as accumulator (c matrix) **and preserves the other half of the value**.
If we do not preserve the other half, we would loose the second part of the packed matrix.

This is where the new tied intrinsic comes into play, it guarantees to preserve the “untouched” part of the value by tying input and output to the same physical VGPRs.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158059/new/

https://reviews.llvm.org/D158059



More information about the llvm-commits mailing list