[PATCH] D158059: [AMDGPU/wmma] - Disable 3-address syntax for f16

Jessica Del via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Aug 18 01:54:16 PDT 2023


OutOfCache added inline comments.


================
Comment at: llvm/lib/Target/AMDGPU/VOP3PInstructions.td:863
     let Mnemonic = Instr, mayRaiseFPException = 0, ReadsModeReg = 0 in {
-      let Constraints = WMMAConstraints2Addr, isConvertibleToThreeAddress = 1 in {
+      let Constraints = WMMAConstraints2Addr, isConvertibleToThreeAddress = isConvertableTo3Addr in {
         def _twoaddr_w32 : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;
----------------
arsenm wrote:
> OutOfCache wrote:
> > arsenm wrote:
> > > Shouldn’t lie about properties, disable at a different point?
> > Where would you suggest?
> Does this happen in two address instructions? I assume the current heuristic assumes a simple register and isn’t accounting for large tuple increasing pressure 
Sorry, what do you mean by 'does this happen'? If you are talking about the conversion from two address to three address instruction, then yes, it can happen. I encountered issues when compiling Stable Diffusion shaders that initialize the matrices via the `zeroinitializer` constant outside a loop.

In the loop case, each matrix receives a dedicated zero matrix. Without the loop, the same zero matrix is reused for multiple `wmma`s. So we have the following scenario:

```
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[24:31]
v_wmma_f16_16x16x16_f16 v[32:30], ..., v[24:31]
```
So we use `v[24:31` as zero matrix as input, but a different destination matrix (e.g., `v[0:7]`). The problem arises once we try to pack:

```
v_wmma_f16_16x16x16_f16 v[0:7],     ..., v[0:7] op_sel [0,0,1]
```

We have no guarantee that the upper halves of `v[0:7]` are zero initialized. In practice, this indeed caused issues (completely black images as SD output). This change fixed it.

You are right that disabling the ability to convert is not the best solution, and I would be happy to move this change elsewhere. Currently I don't know a better place, though. Can we maybe adapt the heuristics to always keep the two address mode for constant input matrices like the `zeroinitializer`? Or does that not make sense either?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158059/new/

https://reviews.llvm.org/D158059



More information about the llvm-commits mailing list