[all-commits] [llvm/llvm-project] 3d7bf6: [X86][Costmodel] Improve cost modelling for not-fu...

Thu Oct 14 13:15:08 PDT 2021

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 3d7bf6625a6e133d745161c043ae0fdb585ac7c9
      https://github.com/llvm/llvm-project/commit/3d7bf6625a6e133d745161c043ae0fdb585ac7c9
  Author: Roman Lebedev <lebedev.ri at gmail.com>
  Date:   2021-10-14 (Thu, 14 Oct 2021)

  Changed paths:
    M llvm/lib/Target/X86/X86TargetTransformInfo.cpp
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll
    M llvm/test/Transforms/LoopVectorize/X86/pr48340.ll

  Log Message:
  -----------
  [X86][Costmodel] Improve cost modelling for not-fully-interleaved load

While i've modelled most of the relevant tuples for AVX2,
that only covered fully-interleaved groups.

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

As we have established over the course of last ~70 patches, (wow)
`BaseT::getInterleavedMemoryOpCos()` is absolutely bogus,
it is usually almost an order of magnitude overestimation,
so i would claim that we should at least use the hardcoded costs
of fully interleaved load groups.

We could go further and adjust them e.g. by the number of demanded indices,
but then i'm somewhat fearful of underestimating the cost.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D111174