[PATCH] D80260: [WIP][SVE] Prototype for general merging MOVPRFX support.

Fri May 22 15:02:15 PDT 2020

cameron.mcinally added a comment.

Thanks, Sander.

I have a nagging feeling that I forced this implementation on you. My intention was to use this patch as an intuition pump, not necessarily as the path forward. I don't have a lot of intuition around MOVPRFX today, so I'm hoping for guidance on the best way to proceed.

I definitely see the desired flexibility of having a lowering pass pre-regalloc. If you think that's the better solution, I'll work on it. I just don't have a strong opinions on where in the pipeline that pass should live.

Also, I plan to look at your zeroing Diff (D80410 <https://reviews.llvm.org/D80410>) next week. Thanks for that.

================
Comment at: llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp:382
 ///
 /// Note that this can only be done for _ZERO or _UNDEF variants where
 /// we can guarantee the false lanes to be zeroed (by implementing this)
----------------
sdesmalen wrote:
> This comment suggests that this is only possible for _ZERO and _UNDEF variants, but I'm not sure if that comment is still correct.
Massaged this a bit. Let me know if anything sounds off. 

================
Comment at: llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp:402
   bool UseRev = false;
-  unsigned PredIdx, DOPIdx, SrcIdx;
+  unsigned PredIdx, DOPIdx, SrcIdx, PassIdx;
   switch (DType) {
----------------
sdesmalen wrote:
> Can you update the description of `expand_DestructiveOp` to describe the new style of pseudos like `FSUB_ZPZZ_MERGE_B`?
Updated. Please let me know if that's what you had in mine. Advertising copy is not my strong suit...

================
Comment at: llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp:408
       // FSUB Zd, Pg, Zs1, Zd  ==> FSUBR   Zd, Pg/m, Zd, Zs1
-      std::tie(PredIdx, DOPIdx, SrcIdx) = std::make_tuple(1, 3, 2);
+      std::tie(PredIdx, DOPIdx, SrcIdx, PassIdx) = std::make_tuple(1, 3, 2, 4);
       UseRev = true;
----------------
sdesmalen wrote:
> nit: `s/PassIdx/PassthruIdx/` ?
Updated, but I just realized that this isn't needed yet. With DstReg tied to the Passthru reg, I can replace `MI.getOperand(PassIdx).getReg()` with `DestReg` below.

I think I'll leave it for now in case the untied/zeroing case needs it. Can tear it out later if it's not needed.

================
Comment at: llvm/lib/Target/AArch64/SVEInstrFormats.td:471
+    let FalseLanes = flags;
+    let Constraints = "$Zd = $Zpt";
+  }
----------------
sdesmalen wrote:
> I'm happy to keep the constraint for now, but I don't think it is strictly necessary, because if `$Zd != $Zpt` it is still possible to generate a valid instruction sequence.
> 
> For:
> 
>   define <vscale x 4 x float> @foo(<vscale x 4 x i1> %p, <vscale x 4 x float> %z0, <vscale x 4 x float> %z1, <vscale x 4 x float> %passthru) {
>     %z0_in = select <vscale x 4 x i1> %p, <vscale x 4 x float> %z0, <vscale x 4 x float> %passthru
>     %sub = call <vscale x 4 x float> @llvm.aarch64.sve.fsub.nxv4f32(<vscale x 4 x i1> %p, <vscale x 4 x float> %z0_in, <vscale x 4 x float> %z0)
>     ret <vscale x 4 x float> %sub
>   }
> 
> LLVM will generate:
> 
>   movprfx z2.s, p0/m, z0.s
>   fsub    z2.s, p0/m, z2.s, z0.s
>   mov     z0.d, z2.d
>   ret
> 
> Where the last `mov` is needed because the register allocator chose to allocate the pseudo as:
> 
>   z2 = FSUB_PSEUDO p0, z0, z0, z2
> 
> with the result operand z2 tied to the merge value. I think we could alternatively fall back on using a `select` instruction if the register allocator didn't have the tied operand restriction and instead generate:
> 
>   sel z0.s, p0, z0.s, z2.s
>   fsub z0.s, p0/m, z0.s, z0.s
>   ret
> 
Ah, ok, I think I see the problem now. In this case we can clobber z0 since it only has the one use. I'll have to explore this further. 

I think I'll look at the ConditionalEarlyClobberPass to see if this case can be guarded against. If you know it can't, or are aware of other cases that may be problematic, please let me know.

================
Comment at: llvm/test/CodeGen/AArch64/sve-intrinsics-int-arith-merging.ll:62
+; CHECK-NEXT: ret
+  %a_z = select <vscale x 16 x i1> %pg, <vscale x 16 x i8> %a, <vscale x 16 x i8> %passthru
+  %out = call <vscale x 16 x i8> @llvm.aarch64.sve.add.nxv16i8(<vscale x 16 x i1> %pg,
----------------
sdesmalen wrote:
> nit: `the %a_z` name implies that the false lanes are zeroed, perhaps `%a_m` is more appropriate.
Good catch. Fixed.

================
Comment at: llvm/test/CodeGen/AArch64/sve-intrinsics-int-arith-merging.ll:76
+; CHECK-NEXT: ret
+  %a_z = select <vscale x 8 x i1> %pg, <vscale x 8 x i16> %a, <vscale x 8 x i16> %passthru
+  %out = call <vscale x 8 x i16> @llvm.aarch64.sve.add.nxv8i16(<vscale x 8 x i1> %pg,
----------------
sdesmalen wrote:
> I think we should add tests where the %passthru value is `%b` as that should cause it to use the reverse instruction for e.g. `sub -> subr` (or swap the operands in case of `add`).
Added.

Also added a guard to prevent `MOVPRFX Zx, p0/m, Zx` from being generated. We might not want that for the p0/z zero merging case, but I haven't hit that yet.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80260/new/

https://reviews.llvm.org/D80260