[all-commits] [llvm/llvm-project] 8374d4: [RISCV] Decompose single source shuffles (without ...

Tue Feb 11 14:40:59 PST 2025

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 8374d421861cd3d47e21ae7889ba0b4c498e8d85
      https://github.com/llvm/llvm-project/commit/8374d421861cd3d47e21ae7889ba0b4c498e8d85
  Author: Philip Reames <preames at rivosinc.com>
  Date:   2025-02-11 (Tue, 11 Feb 2025)

  Changed paths:
    M llvm/lib/Target/RISCV/RISCVISelLowering.cpp
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-changes-length.ll

  Log Message:
  -----------
  [RISCV] Decompose single source shuffles (without exact VLEN) (#126108)

This is a continuation of the work started in #125735 to lower selected
VLA shuffles in linear m1 components instead of generating O(LMUL^2) or
O(LMUL*Log2(LMUL) high LMUL shuffles.

This pattern focuses on shuffles where all the elements being used
across the entire destination register group come from a single register
in the source register group. Such cases come up fairly frequently via
e.g. spread(N), and repeat(N) idioms.

One subtlety to this patch is the handling of the index vector for
vrgatherei16.vv. Because the index and source registers can have
different EEW, the index vector for the Nth chunk of the destination is
not guaranteed to be register aligned. In fact, it is common for e.g. an
EEW=64 shuffle to have EEW=16 indices which are four chunks per source
register. Given this, we have to pay a cost for extracting these chunks
into the low position before performing each shuffle.

I'd initially expressed this as a naive extract sub-vector for each data
parallel piece. However, at high LMUL, this quickly caused register
pressure problems since we could at worst need 4x the temporary
registers for the index. Instead, this patch uses a repeating slidedown
chained from previous iterations. This increases critical path by at
worst 3 slides (SEW=64 is the worst case), but reduces register pressure
to at worst 2x - and only if the original index vector is reused
elsewhere. I view this as arguably a bit of a workaround (since our
scheduling should have done better with the plain extract variant), but
a probably neccessary one.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications