[PATCH] D71432: [AArch64][SVE] Proposal to use op+select to match scalable predicated operations

Tue Dec 17 15:48:50 PST 2019

sdesmalen added a comment.

Hi @cameron.mcinally, thanks for sharing this patch!

For the purpose of merging a select with an unpredicated operation into a predicated operation, this is indeed sufficient. But I wonder if we need something a bit more elaborate if the intended purpose is to more cheaply select a value for the false-lanes (passthru).

While we don't support the generic case in our downstream compiler, we do have special support for the cases where the false lanes are zeroed or `undef`. Using the predicated MOVPRFX instruction, the false lanes can be zeroed relatively cheaply using e.g.:

  movprfx z0.s, p0/z, z1.s
  fsub z0.s, p0/m, z2.s

This avoids having to emit an explicit sequence of a `splat` and `select / predicated mov` to zero the false lanes. We match the `operation + select` into a Pseudo instruction (e.g. `FSUB_ZERO` or `FSUB_UNDEF`), that is expanded after register allocation (in the AArch64ExpandPseudoInsts pass) into the appropriate instructions.

Even if we don't care about selecting a passthru value for the false lanes, there is still value in creating the Pseudo. The lack of a tied-operand constraint for the Pseudo gives the register allocator more freedom to come up with a better allocation. Combined with the commutative property of some instructions or by expanding to their reversed variants (like SUBR vs SUB), we can avoid a number of unnecessary register moves.

We've been thinking about some ideas on how to make this support more generic to allow supporting the general use-case of:

  %Res = FSUB_PSEUDO(%Pred, %Op1, %Op2, %Passthru)

Depending on the value for `%Passthru`, this can be expanded to use a `movprfx` or in the worst case an explicit `select`.

Ideally we'd use a Pseudo for most operations so that we can use this as a generic mechanism that natively supports the `passthru` value and benefits from better register allocation.

A bit of prototyping would be required though, as our downstream compiler only covers a limited use-case. We've also had to deal with some corner-cases, but I'd need to refresh my memory on the details of those before I can comment on those. I'll try to dig up some more details!

================
Comment at: llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td:163

-  defm FADD_ZPmZ   : sve_fp_2op_p_zds<0b0000, "fadd",   int_aarch64_sve_fadd>;
-  defm FSUB_ZPmZ   : sve_fp_2op_p_zds<0b0001, "fsub",   int_aarch64_sve_fsub>;
-  defm FMUL_ZPmZ   : sve_fp_2op_p_zds<0b0010, "fmul",   int_aarch64_sve_fmul>;
+  defm FADD_ZPmZ   : sve_fp_2op_p_zds_pred<0b0000, "fadd", fadd, int_aarch64_sve_fadd>;
+  defm FSUB_ZPmZ   : sve_fp_2op_p_zds_pred<0b0001, "fsub", fsub, int_aarch64_sve_fsub>;
----------------
nit: the `_pred` isn't needed, as this is already implied by the `_p`.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D71432/new/

https://reviews.llvm.org/D71432