[PATCH] D140208: [AMDGPU] Improved wide multiplies

Tue Feb 14 07:34:36 PST 2023

OutOfCache added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:3007
             auto Mul = B.buildMul(S32, Src0[j0], Src1[j1]);
-            if (!LocalAccum[0]) {
+            if (!LocalAccum[0] || KB.getKnownBits(LocalAccum[0]).isZero()) {
               LocalAccum[0] = Mul.getReg(0);
----------------
tsymalla wrote:
> OutOfCache wrote:
> > OutOfCache wrote:
> > > arsenm wrote:
> > > > tsymalla wrote:
> > > > > OutOfCache wrote:
> > > > > > arsenm wrote:
> > > > > > > OutOfCache wrote:
> > > > > > > > This check is required, when the accumulator is a zero register.
> > > > > > > > 
> > > > > > > > `!LocalAccum[0]` only checks for the existence of a Register. It is still true, if the Register is known to be all zeroes.
> > > > > > > > This particular case occurs when the lower bytes of an operand are masked. 
> > > > > > > > In that case, the check in line 3048 will fail and no `G_MAD` will be created. `LocalAccum[0]` will still be set to the result of the Unmerge of the `Tmp` register in line 3060. `Tmp` is set to a zero register in line 3041, so it is all zeroes at this point.
> > > > > > > > 
> > > > > > > > By stepping through the debugger, I confirmed that in that case the first condition, `!LocalAccum[0]` will be false, but the second condition will be correctly evaluated to true and therefore skip the addition to 0.
> > > > > > > If you're just looking for zero, just looking for the constant zero is cheaper than going through getKnownBits
> > > > > > Sounds like a good idea, but how do I do that?
> > > > > I guess he meant checking the operands for being zero explicitly. I think using `getKnownBits` is fine.
> > > > Check if it's G_CONSTANT i32 0. There are a few too many ways to check for it (I'd suggest MIPatternMatch's m_ZeroInt)
> > > I tried `mi_match(LocalAccum[0], MRI, m_ZeroInt())`, but for some reason it always returned false.
> > > 
> > > I also tried replacing the `SrcXKnownZeros.push_back(KB.getKnownBits(SrcX[i]).isZero())` with `Src0KnownZeros.push_back(mi_match(SrcX[i], MRI, m_ZeroInt())` and similarly, it returned false when the first one returned true.
> > > 
> > > This also caused the `@v_mul_i64_masked_src0_lo` and `@v_mul_i64_masked_src1_lo` tests to fail and produce multiplications with 0.
> > This is the Code before the Legalizer:
> > 
> > ```
> > bb.1.entry:
> >   liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
> >   %2:_(s32) = COPY $vgpr0
> >   %3:_(s32) = COPY $vgpr1
> >   %0:_(s64) = G_MERGE_VALUES %2:_(s32), %3:_(s32)
> >   %4:_(s32) = COPY $vgpr2
> >   %5:_(s32) = COPY $vgpr3
> >   %1:_(s64) = G_MERGE_VALUES %4:_(s32), %5:_(s32)
> >   %6:_(s64) = G_CONSTANT i64 -4294967296
> >   %7:_(s64) = G_AND %1:_, %6:_
> >   %8:_(s64) = G_MUL %0:_, %7:_
> >   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %8:_(s64)
> >   $vgpr0 = COPY %9:_(s32)
> >   $vgpr1 = COPY %10:_(s32)
> >   SI_RETURN implicit $vgpr0, implicit $vgpr1
> > ```
> > 
> > The only G_CONSTANTs are the mask for the G_AND and a 64-bit 0 for the G_MAD addition 
> Your MIR should look something like that at the time of the LocalAccum[0] check, so you need to query appropriately.
> 
> ```
> /Users/seuchomat/Documents/Projekte/C++/llvm-project/build/bin/llc -global-isel -march=amdgcn -mcpu=gfx1010 /Users/seuchomat/Documents/Projekte/C++/llvm-project/llvm/test/CodeGen/AMDGPU/GlobalISel/mad.mir -run-pass=legalizer
>   %18:_(s32) = G_MUL %11:_, %14:_
> # Machine code for function test_mad: IsSSA, NoPHIs
> 
> bb.0.entry:
>   %0:_(s32) = COPY $vgpr0
>   %1:_(s32) = COPY $vgpr1
>   %2:_(s64) = G_MERGE_VALUES %0:_(s32), %1:_(s32)
>   %3:_(s32) = COPY $vgpr2
>   %4:_(s32) = COPY $vgpr3
>   %5:_(s64) = G_MERGE_VALUES %3:_(s32), %4:_(s32)
>   %6:_(s64) = G_CONSTANT i64 -4294967296
>   %7:_(s64) = G_AND %5:_, %6:_
>   %11:_(s32), %13:_(s32) = G_UNMERGE_VALUES %2:_(s64)
>   %12:_(s32), %14:_(s32) = G_UNMERGE_VALUES %7:_(s64)
>   %15:_(s64) = G_CONSTANT i64 0
>   %16:_(s32), %17:_(s32) = G_UNMERGE_VALUES %15:_(s64)
>   %18:_(s32) = G_MUL %11:_, %14:_
>   %8:_(s64) = G_MUL %2:_, %7:_
>   %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %8:_(s64)
>   $vgpr0 = COPY %9:_(s32)
>   $vgpr1 = COPY %10:_(s32)
>   SI_RETURN implicit $vgpr0, implicit $vgpr1
> 
> # End machine code for function test_mad.
> ```
> 
> So, I think, by applying your pattern matching to the register itself, it tries to find the last instruction that uses the constant 0 which is the G_UNMERGE_VALUES itself in llvm::getConstantVRegValWithLookThrough.
I tried following the suggestion of using `mi_match`. Unfortunately, it did not work. There is no `i32 G_CONSTANT` that can be matched. 

The `LocalAccum` has the result of the `G_MAD`, which is not using a `G_CONSTANT`, even if the result is 0.
```
  %16:_(s64), %17:_(s1) = G_AMDGPU_MAD_U64_U32 %2:_(s32), %12:_, %15:_
  %18:_(s32), %19:_(s32) = G_UNMERGE_VALUES %16:_(s64)
``` 
Essentially we are looking at %18 and %19. If any of these are all zeroes, which we only know using the Known Bits Analysis (to my knowledge), we can save the following G_ADDs here:
```
%20:_(s32) = G_MUL %2:_, %14:_
%21:_(s32) = G_ADD %19:_, %20:_
%22:_(s32) = G_MUL %3:_, %12:_
%23:_(s32) = G_ADD %21:_, %22:_
```
The Multiplication Arguments are within VGPRs, for which the `mi_match` does not work.

Thank you for the suggestion, though, @arsenm! It would have been more efficient if I managed to make it work. Plus, I learned a lot along the way.

In case I should try something else, feel free to let me know.

I apologize for the delay regarding my answer, I took time off for my exams.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140208/new/

https://reviews.llvm.org/D140208