[PATCH] D150143: [X86] Add X86FixupVectorConstantsPass to fold vectors constant loads as broadcasts (WIP)

Sun May 14 07:07:30 PDT 2023

RKSimon planned changes to this revision.
RKSimon added a comment.

In D150143#4328111 <https://reviews.llvm.org/D150143#4328111>, @goldstein.w.n wrote:

>> The principal aim is to prevent the premature creation of broadcasts that prevent us folding the loads with another instruction, helping to reduce register pressure.
>
> There are two other problems we have with constant generation that it might also be nice to address with a new pass but that I don't think this approach will be
> able to handle.
>
> 1. There are a variety of cases where we could get "better" constants by re-ordering instructions which is very difficult to do at any stage in the current DAG lowering process without creating an infinite loop, but is also something that would be very difficult to do after lowering at the machineinstruction level. I tried to do this for  `shl; and` in D141653 <https://reviews.llvm.org/D141653> but it ran into an infinite loop when I did more robust testing, and we could imagine also doing it for vector-compares, add/sub, etc...
>
> 2. There are constants like `splat(1)`, `splat(mask)`, `splat(-mask), etc...` which can often be preferable to build without a load at all (`abs(ALL_ONES)`, `shr(ALL_ONES)`, `shl(ALL_ONES)`, etc...), especially if the fairly common `ALL_ONES` node dominates.
>
> I think both of these goal lend themselves to a pass before DAG lowering has been completed, but after the common transforms/optimizations.
> I was thinking a pass in `CodeGenAndEmitDAG` after the fourth post-legalization `Combine` but before `DoInstructionSelection` would make the
> most sense for fixing up constant generation (including the aims of this patch).

I think we agree that broadcasting the constants in build vector lowering is premature and working on full width constant data is going to be a lot easier.

I have been investigating constant rematerialization as part of this patch and its looking like it should be handled separately (maybe as part of better general instruction rematerialization handling) - your proposal to investigate fitting rematerialization into CodeGenAndEmitDAG makes sense to me.

But I'm seeing most problems with premature constant broadcasting regarding folding of the constants to stack and hoisting, so a great deal of these cases are going to need to be addressed much later than DAG, I think I have this proposed pass in the correct location for this, but I will drop the proposal for it to handle constant rematerialization, this should be a true fixup for poorly lowered vector constant loads, missed AVX512 broadcast folds, etc.

I'll split off another patch.

================
Comment at: llvm/lib/Target/X86/X86FixupVectorConstants.cpp:208
+  case X86::VMOVDQUrm:
+    if (ST->hasInt256())
+      return ConvertToBroadcastInt(X86::VPBROADCASTQrm, X86::VPBROADCASTDrm,
----------------
pengfei wrote:
> What's the reason to check `hasInt256` rather than `hasAVX2`?
There's actually very little consistency between the 2 checks (which are identical), I try to use hasInt256 when I'm identifying cases where we're selecting 256-bit ops between AVX1 and AVX2 for 256-bit integer ops - I can change it to hasAVX2().

================
Comment at: llvm/lib/Target/X86/X86FixupVectorConstants.cpp:211
+                                   X86::VPBROADCASTWrm, X86::VPBROADCASTBrm);
+    return false; // TODO: Add AVX VMOVDDUP/VBROADCASTSS.
+  case X86::VMOVDQAYrm:
----------------
pengfei wrote:
> Should avoid FP instruction for integer?
Interestingly, DAG already uses MOVDDUP in some cases to broadcast integer, but I agree we should avoid it in most cases.

================
Comment at: llvm/lib/Target/X86/X86FixupVectorConstants.cpp:227
+  case X86::VMOVUPSrm:
+    return ConvertToBroadcastFP(X86::VMOVDDUPrm, X86::VBROADCASTSSrm);
+  case X86::VMOVAPDYrm:
----------------
pengfei wrote:
> Why don't use `VBROADCASTSD` here?
xmm VBROADCASTSD doesn't exist - it would have exactly the same behaviour as VMOVDDUP

================
Comment at: llvm/test/CodeGen/X86/combine-concatvectors.ll:67-68
 ; AVX2-NEXT:    vmovups %ymm0, 48296(%rax)
-; AVX2-NEXT:    vmovlps %xmm0, 47372(%rax)
+; AVX2-NEXT:    vmovsd {{.*#+}} xmm0 = mem[0],zero
+; AVX2-NEXT:    vmovsd %xmm0, 47372(%rax)
 ; AVX2-NEXT:    vzeroupper
----------------
pengfei wrote:
> Regression?
Yes, we previously reused the ymm0 splat to avoid a second load

================
Comment at: llvm/test/CodeGen/X86/horizontal-reduce-umin.ll:1170-1171
 ; X86-AVX2-NEXT:    vextractf128 $1, %ymm0, %xmm1
+; X86-AVX2-NEXT:    vmovddup {{.*#+}} xmm2 = [9223372036854775808,9223372036854775808]
+; X86-AVX2-NEXT:    ## xmm2 = mem[0,0]
 ; X86-AVX2-NEXT:    vxorpd %xmm2, %xmm0, %xmm3
----------------
pengfei wrote:
> This looks like regression too.
Yes, we have some peepholes in DAG to reuse broadcasts of different widths - I'll address this in a later iteration.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150143/new/

https://reviews.llvm.org/D150143