[PATCH] D71432: [AArch64][SVE] Proposal to use op+select to match scalable predicated operations

Tue Jan 7 09:27:43 PST 2020

sdesmalen added a comment.

In D71432#1803402 <https://reviews.llvm.org/D71432#1803402>, @cameron.mcinally wrote:

> I just checked out the PseudoOps and they're interesting. If we could make generic `XXX_PSEUDO(..., %passthru)` PseudoOps, that would be a good path forward. I don't foresee any problems adding an extra passthru operand to the existing patterns, but maybe I'm missing something. If you have any insight, it would be appreciated.

Awesome, thanks for checking this out.

I don't see many issues with adding the extra passthru operand, but it could do with a bit of prototyping; what we've done downstream was very specific to having the false lanes zero'd and that is actually unnecessarily restrictive (but we never really had to care for the generic case). The challenge would be in expanding the pseudo. For some pseudo for that has a reverse variant, e.g. `Dst = fsub_pseudo(Pg, Op1, Op2, Passthru)`

we would for example expand this as follows:

  z0 = fsub_pseudo_S(p0, z1, z2, z3)
    <=>
  sel z0.s, p0.s, z1.s, z3.s
  fsub z0.s, p0/m, z2.s

  z0 = fsub_pseudo_S(p0, z1, z2, z2)
    <=>
  movprfx z0, z2
  fsubr z2.s, p0/m, z1.s

  or a special case for zero'd lanes:

  z0 = fsub_pseudo_S(p0, z1, z2, <zero>)
  movprfx z0, p0/z, z1
  fsub z0.s, p0/m, z2.s

The indexed FMLA instruction has three input operands and this is a case where we can't relaxation the register constraints:

  %dst = Pseudo_FMLA %a, %n, %m, %index
  (to implement: %dst = FMLA %a + %n * %m[%index])

We cannot recover from a register allocation where %index is used as %dst, e.g.

  Z0 = Pseudo_FMLA Z1, Z2, Z0, <index>

(there is no other variant we can use to recover from this, so those will need to retain a constraint that `$Zd = $Zs1`)

> I agree that if these PseudoOps will land upstream, then the op+select solution isn't the right way to go. I'll see if I can build the PseudoOps out a bit.
> 
> That said, the current implementation is a little weird though. Here's the class that something like an FADD_ZERO would use:
> 
>   class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
>                      ValueType vt2, ValueType vt3, Instruction inst>
>   : Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
>         (inst $Op1, $Op2, $Op3)>;
> 
> 
> The select+op DAG reads like we're zeroing the inactive input elements, not the inactive output elements. I see that this directly models the movprfx+op hardware instructions, and that Op2 is an input reg as well as the destination reg, but it still seems counter-intuitive. I don't feel strongly that this needs to change though. So if I'm the only one that thinks it's weird, I'll let it drop.

Yes, that's mostly because of the pattern that the front-end generates for zeroing the false lanes, but it is functionally the same. I guess we could have two patterns to map to the pseudo, depending on how it looks in the IR.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71432/new/

https://reviews.llvm.org/D71432