[all-commits] [llvm/llvm-project] 947882: [RISCV] Decompose single source shuffles (without ...

Wed Feb 12 12:10:58 PST 2025

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 9478822f4f63aa2e5f7bc120406688298911fa24
      https://github.com/llvm/llvm-project/commit/9478822f4f63aa2e5f7bc120406688298911fa24
  Author: Philip Reames <preames at rivosinc.com>
  Date:   2025-02-12 (Wed, 12 Feb 2025)

  Changed paths:
    M llvm/lib/Target/RISCV/RISCVISelLowering.cpp
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-fp-interleave.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-interleave.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-int-shuffles.ll
    M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-changes-length.ll

  Log Message:
  -----------
  [RISCV] Decompose single source shuffles (without exact VLEN) (#126951)

(This is a re-apply for what was 8374d42. The bug there was fairly 
major - despite the comments and review description, the code was 
using each register in the source register group, not only the first 
register. This was completely wrong.)

This is a continuation of the work started in
https://github.com/llvm/llvm-project/pull/125735 to lower selected VLA
shuffles in linear m1 components instead of generating O(LMUL^2) or
O(LMUL*Log2(LMUL) high LMUL shuffles.

This pattern focuses on shuffles where all the elements being used
across the entire destination register group come from a single register
in the source register group. Such cases come up fairly frequently via
e.g. spread(N), and repeat(N) idioms.

One subtlety to this patch is the handling of the index vector for
vrgatherei16.vv. Because the index and source registers can have
different EEW, the index vector for the Nth chunk of the destination is
not guaranteed to be register aligned. In fact, it is common for e.g. an
EEW=64 shuffle to have EEW=16 indices which are four chunks per source
register. Given this, we have to pay a cost for extracting these chunks
into the low position before performing each shuffle.

I'd initially expressed this as a naive extract sub-vector for each data
parallel piece. However, at high LMUL, this quickly caused register
pressure problems since we could at worst need 4x the temporary
registers for the index. Instead, this patch uses a repeating slidedown
chained from previous iterations. This increases critical path by at
worst 3 slides (SEW=64 is the worst case), but reduces register pressure
to at worst 2x - and only if the original index vector is reused
elsewhere. I view this as arguably a bit of a workaround (since our
scheduling should have done better with the plain extract variant), but
a probably necessary one.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications