[all-commits] [llvm/llvm-project] febbf9: [RISCV] Match vcompress during shuffle lowering (#...

Wed Nov 27 13:23:40 PST 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: febbf9105f7101d7124e802e87d8303237b64a80
      https://github.com/llvm/llvm-project/commit/febbf9105f7101d7124e802e87d8303237b64a80
  Author: Philip Reames <preames at rivosinc.com>
  Date:   2024-11-27 (Wed, 27 Nov 2024)

  Changed paths:
    M llvm/lib/Target/RISCV/RISCVISelLowering.cpp
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-buildvec.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-shuffles.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shufflevector-vnsrl.ll
    M llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll

  Log Message:
  -----------
  [RISCV] Match vcompress during shuffle lowering (#117748)

This change matches a subset of vcompress patterns during shuffle
lowering. The subset implemented requires a contiguous prefix of
demanded elements followed by undefs. This subset was chosen for two
reasons: 1) which elements to spurious demand is a non-obvious problem,
and 2) my first several attempts at implementing the general case were
buggy. I decided to go with the simple case to start with.

vcompress scales better with LMUL than a general vrgather, and at least
the SpaceMit X60, has higher throughput even at m1. It also has the
advantage of requiring smaller vector constants at one bit per element
as opposed to vrgather which is a minimum of 8 bits per element. The
downside to using vcompress is that we can't fold a vselect into it, as
there is no masked vcompress variant.

For reference, here are the relevant throughputs from camel-cdr's data
table on BP3 (X60):
  vrgather.vv v8,v16,v24    4.0  16.0  64.0  256.0
  vcompress.vm v8,v16,v24   3.0  10.0  36.0  136.
  vmerge.vvm v8,v16,v24,v0  2.0  4.0   8.0   16.0

The largest concern with the extra vmerge is that we locally increase
register pressure. If we do have masking, we also have a passthru,
without the ability to fold that into the vcompress, we need to keep it
alive a bit longer. This can hurt at e.g. m8 where we have very few
architectural registers. As compared with the vrgather.vv sequence, this
is only one additional m1 VREG - since we no longer need the index
vector. It compares slightly worse against vrgatherie16.vv which can use
index vectors smaller than other operands. Note that we could
potentially fold the vmerge if only tail elements are being preserved; I
haven't investigated this.

It is unfortunately hard given our current lowering structure to know if
we're emitting a shuffle where masking will follow. Thankfully, it
doesn't seem to show up much in practice, so I think we can probably
ignore it.

This patch only handles single source compress idioms at the moment.
This is an effort to avoid interacting with other patches on review for
changing how we canonicalize length changing shuffles.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications