[PATCH] D140208: [AMDGPU] Improved wide multiplies
Jessica Del via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Feb 14 07:34:36 PST 2023
OutOfCache added inline comments.
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:3007
auto Mul = B.buildMul(S32, Src0[j0], Src1[j1]);
- if (!LocalAccum[0]) {
+ if (!LocalAccum[0] || KB.getKnownBits(LocalAccum[0]).isZero()) {
LocalAccum[0] = Mul.getReg(0);
----------------
tsymalla wrote:
> OutOfCache wrote:
> > OutOfCache wrote:
> > > arsenm wrote:
> > > > tsymalla wrote:
> > > > > OutOfCache wrote:
> > > > > > arsenm wrote:
> > > > > > > OutOfCache wrote:
> > > > > > > > This check is required, when the accumulator is a zero register.
> > > > > > > >
> > > > > > > > `!LocalAccum[0]` only checks for the existence of a Register. It is still true, if the Register is known to be all zeroes.
> > > > > > > > This particular case occurs when the lower bytes of an operand are masked.
> > > > > > > > In that case, the check in line 3048 will fail and no `G_MAD` will be created. `LocalAccum[0]` will still be set to the result of the Unmerge of the `Tmp` register in line 3060. `Tmp` is set to a zero register in line 3041, so it is all zeroes at this point.
> > > > > > > >
> > > > > > > > By stepping through the debugger, I confirmed that in that case the first condition, `!LocalAccum[0]` will be false, but the second condition will be correctly evaluated to true and therefore skip the addition to 0.
> > > > > > > If you're just looking for zero, just looking for the constant zero is cheaper than going through getKnownBits
> > > > > > Sounds like a good idea, but how do I do that?
> > > > > I guess he meant checking the operands for being zero explicitly. I think using `getKnownBits` is fine.
> > > > Check if it's G_CONSTANT i32 0. There are a few too many ways to check for it (I'd suggest MIPatternMatch's m_ZeroInt)
> > > I tried `mi_match(LocalAccum[0], MRI, m_ZeroInt())`, but for some reason it always returned false.
> > >
> > > I also tried replacing the `SrcXKnownZeros.push_back(KB.getKnownBits(SrcX[i]).isZero())` with `Src0KnownZeros.push_back(mi_match(SrcX[i], MRI, m_ZeroInt())` and similarly, it returned false when the first one returned true.
> > >
> > > This also caused the `@v_mul_i64_masked_src0_lo` and `@v_mul_i64_masked_src1_lo` tests to fail and produce multiplications with 0.
> > This is the Code before the Legalizer:
> >
> > ```
> > bb.1.entry:
> > liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3
> > %2:_(s32) = COPY $vgpr0
> > %3:_(s32) = COPY $vgpr1
> > %0:_(s64) = G_MERGE_VALUES %2:_(s32), %3:_(s32)
> > %4:_(s32) = COPY $vgpr2
> > %5:_(s32) = COPY $vgpr3
> > %1:_(s64) = G_MERGE_VALUES %4:_(s32), %5:_(s32)
> > %6:_(s64) = G_CONSTANT i64 -4294967296
> > %7:_(s64) = G_AND %1:_, %6:_
> > %8:_(s64) = G_MUL %0:_, %7:_
> > %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %8:_(s64)
> > $vgpr0 = COPY %9:_(s32)
> > $vgpr1 = COPY %10:_(s32)
> > SI_RETURN implicit $vgpr0, implicit $vgpr1
> > ```
> >
> > The only G_CONSTANTs are the mask for the G_AND and a 64-bit 0 for the G_MAD addition
> Your MIR should look something like that at the time of the LocalAccum[0] check, so you need to query appropriately.
>
> ```
> /Users/seuchomat/Documents/Projekte/C++/llvm-project/build/bin/llc -global-isel -march=amdgcn -mcpu=gfx1010 /Users/seuchomat/Documents/Projekte/C++/llvm-project/llvm/test/CodeGen/AMDGPU/GlobalISel/mad.mir -run-pass=legalizer
> %18:_(s32) = G_MUL %11:_, %14:_
> # Machine code for function test_mad: IsSSA, NoPHIs
>
> bb.0.entry:
> %0:_(s32) = COPY $vgpr0
> %1:_(s32) = COPY $vgpr1
> %2:_(s64) = G_MERGE_VALUES %0:_(s32), %1:_(s32)
> %3:_(s32) = COPY $vgpr2
> %4:_(s32) = COPY $vgpr3
> %5:_(s64) = G_MERGE_VALUES %3:_(s32), %4:_(s32)
> %6:_(s64) = G_CONSTANT i64 -4294967296
> %7:_(s64) = G_AND %5:_, %6:_
> %11:_(s32), %13:_(s32) = G_UNMERGE_VALUES %2:_(s64)
> %12:_(s32), %14:_(s32) = G_UNMERGE_VALUES %7:_(s64)
> %15:_(s64) = G_CONSTANT i64 0
> %16:_(s32), %17:_(s32) = G_UNMERGE_VALUES %15:_(s64)
> %18:_(s32) = G_MUL %11:_, %14:_
> %8:_(s64) = G_MUL %2:_, %7:_
> %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %8:_(s64)
> $vgpr0 = COPY %9:_(s32)
> $vgpr1 = COPY %10:_(s32)
> SI_RETURN implicit $vgpr0, implicit $vgpr1
>
> # End machine code for function test_mad.
> ```
>
> So, I think, by applying your pattern matching to the register itself, it tries to find the last instruction that uses the constant 0 which is the G_UNMERGE_VALUES itself in llvm::getConstantVRegValWithLookThrough.
I tried following the suggestion of using `mi_match`. Unfortunately, it did not work. There is no `i32 G_CONSTANT` that can be matched.
The `LocalAccum` has the result of the `G_MAD`, which is not using a `G_CONSTANT`, even if the result is 0.
```
%16:_(s64), %17:_(s1) = G_AMDGPU_MAD_U64_U32 %2:_(s32), %12:_, %15:_
%18:_(s32), %19:_(s32) = G_UNMERGE_VALUES %16:_(s64)
```
Essentially we are looking at %18 and %19. If any of these are all zeroes, which we only know using the Known Bits Analysis (to my knowledge), we can save the following G_ADDs here:
```
%20:_(s32) = G_MUL %2:_, %14:_
%21:_(s32) = G_ADD %19:_, %20:_
%22:_(s32) = G_MUL %3:_, %12:_
%23:_(s32) = G_ADD %21:_, %22:_
```
The Multiplication Arguments are within VGPRs, for which the `mi_match` does not work.
Thank you for the suggestion, though, @arsenm! It would have been more efficient if I managed to make it work. Plus, I learned a lot along the way.
In case I should try something else, feel free to let me know.
I apologize for the delay regarding my answer, I took time off for my exams.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D140208/new/
https://reviews.llvm.org/D140208
More information about the llvm-commits
mailing list