[PATCH] D137913: [X86] Rewrite `getScalarizationOverhead()`

Sun Nov 13 12:29:37 PST 2022

lebedev.ri created this revision.
lebedev.ri added a reviewer: RKSimon.
lebedev.ri added a project: LLVM.
Herald added subscribers: pengfei, arphaman, hiraditya.
Herald added a project: All.
lebedev.ri requested review of this revision.
Herald added a subscriber: pcwang-thead.

All of our insert/extract ops work on 128-bit lanes.

For `Insert`, we need to extract affected 128-bit lane,
unless it's being fully overwritten (FIXME: do we need to be
careful about legalization-induced padding that we obviously don't demand?),
perform insertions, and then insert the 128-bit lane back.

But hold on. If we are operating on an 256-bit legal vector,
and thus have two 128-bit subvectors, and are fully overwriting them both,
we don't actually need to insert *both* subvectors,
only the second one, into the implicitly-widened first one.

`getShuffleCost(TTI::SK_ExtractSubvector)` notes:

  // Subvector insertions are cheap if the subvectors are aligned.
  // Note that in general, the insertion starting at the beginning of a vector
  // isn't free, because we need to preserve the rest of the wide vector.

So as far as i can tell, we didn't account for that.

I was hoping this would allow vectorization at a higher VF at one case i looked at,
but the subvector insertion cost is still dis-advising that.

The change for `Extract` is NFC, and is for consistency only,
i wanted to get rid of of that weird explicit discounting of insertion of 0'th element,
since the general code should already deal with that.

There is a related, subtle "gotcha" in `getVectorInstrCost(InsertElement)`:

  if (Index == 0) {
    // Floating point scalars are already located in index #0.
    // Many insertions to #0 can fold away for scalar fp-ops, so let's assume
    // true for all.
    if (ScalarType->isFloatingPointTy())
      return RegisterFileMoveCost;

    // Assume movd/movq XMM -> GPR is relatively cheap on all targets.
    if (ScalarType->isIntegerTy() && Opcode == Instruction::ExtractElement)
      return 1 + RegisterFileMoveCost;
  }

If we implicitly widen to i32/i64, and cheaply insert at index 0,
what happens to the elements "overwritten" by padding?
This is fine if they are also going to be overwritten,
but the function does not know that...

Test update has been brought to you by `find /repositories/llvm-project/llvm/test/Analysis/CostModel/X86/ -iname \*.ll | xargs -L0 -P32 /repositories/llvm-project/llvm/utils/update_analyze_test_checks.py --opt-binary ./bin/opt`

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D137913

Files:
  llvm/lib/Target/X86/X86TargetTransformInfo.cpp
  llvm/test/Analysis/CostModel/X86/arith-fp-codesize.ll
  llvm/test/Analysis/CostModel/X86/arith-fp-latency.ll
  llvm/test/Analysis/CostModel/X86/arith-fp-sizelatency.ll
  llvm/test/Analysis/CostModel/X86/arith-fp.ll
  llvm/test/Analysis/CostModel/X86/bitreverse-codesize.ll
  llvm/test/Analysis/CostModel/X86/bitreverse-latency.ll
  llvm/test/Analysis/CostModel/X86/bitreverse-sizelatency.ll
  llvm/test/Analysis/CostModel/X86/fmaxnum-size-latency.ll
  llvm/test/Analysis/CostModel/X86/fminnum-size-latency.ll
  llvm/test/Analysis/CostModel/X86/fptoi_sat.ll
  llvm/test/Analysis/CostModel/X86/fptosi.ll
  llvm/test/Analysis/CostModel/X86/fptoui.ll
  llvm/test/Analysis/CostModel/X86/gather-i16-with-i8-index.ll
  llvm/test/Analysis/CostModel/X86/gather-i32-with-i8-index.ll
  llvm/test/Analysis/CostModel/X86/gather-i64-with-i8-index.ll
  llvm/test/Analysis/CostModel/X86/gather-i8-with-i8-index.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-f32-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i16-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i64-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-load-i8-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-f32-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i16-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i32-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i64-stride-8.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-2.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-3.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-4.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-5.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-6.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-7.ll
  llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-8.ll
  llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll
  llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll
  llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost-inseltpoison.ll
  llvm/test/Analysis/CostModel/X86/masked-intrinsic-cost.ll
  llvm/test/Analysis/CostModel/X86/powi.ll
  llvm/test/Analysis/CostModel/X86/shuffle-replication-i1.ll
  llvm/test/Analysis/CostModel/X86/shuffle-replication-i16.ll
  llvm/test/Analysis/CostModel/X86/shuffle-replication-i32.ll
  llvm/test/Analysis/CostModel/X86/shuffle-replication-i64.ll
  llvm/test/Analysis/CostModel/X86/shuffle-replication-i8.ll
  llvm/test/Analysis/CostModel/X86/sitofp.ll
  llvm/test/Analysis/CostModel/X86/trunc.ll
  llvm/test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll
  llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-reuse.ll