[all-commits] [llvm/llvm-project] 8fac9e: [X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: ...

Fri Oct 22 06:34:20 PDT 2021

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 8fac9e95ade9c771c73b2d7a59316e1e5a2b145d
      https://github.com/llvm/llvm-project/commit/8fac9e95ade9c771c73b2d7a59316e1e5a2b145d
  Author: Roman Lebedev <lebedev.ri at gmail.com>
  Date:   2021-10-22 (Fri, 22 Oct 2021)

  Changed paths:
    M llvm/lib/Target/X86/X86TargetTransformInfo.cpp
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-2-indices-0u.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-01u.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-3-indices-0uu.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-012u.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-01uu.ll
    M llvm/test/Analysis/CostModel/X86/interleaved-load-i32-stride-4-indices-0uuu.ll

  Log Message:
  -----------
  [X86] `X86TTIImpl::getInterleavedMemoryOpCost()`: scale interleaving cost by the fraction of live members

By definition, interleaving load of stride N means:
load N*VF elements, and shuffle them into N VF-sized vectors,
with 0'th vector containing elements `[0, VF)*stride + 0`,
and 1'th vector containing elements `[0, VF)*stride + 1`.
Example: https://godbolt.org/z/df561Me5E (i64 stride 4 vf 2 => cost 6)

Now, not fully interleaved load, is when not all of these vectors is demanded.
So at worst, we could just pretend that everything is demanded,
and discard the non-demanded vectors. What this means is that the cost
for not-fully-interleaved group should be not greater than the cost
for the same fully-interleaved group, but perhaps somewhat less.
Examples:
https://godbolt.org/z/a78dK5Geq (i64 stride 4 (indices 012u) vf 2 => cost 4)
https://godbolt.org/z/G91ceo8dM (i64 stride 4 (indices 01uu) vf 2 => cost 2)
https://godbolt.org/z/5joYob9rx (i64 stride 4 (indices 0uuu) vf 2 => cost 1)

Right now, for such not-fully-interleaved loads we just use the costs
for fully-interleaved loads. But at least **in general**,
that is obviously overly pessimistic, because **in general**,
not all the shuffles needed to perform the full interleaving
will end up being live.

So what this does, is naively scales the interleaving cost
by the fraction of the live members. I believe this should still result
in the right ballpark cost estimate, although it may be over/under -estimate.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D112307