[all-commits] [llvm/llvm-project] 3b64ed: [RISCV] Decompose LMUL > 1 reverses into LMUL * M1...
Luke Lau via All-commits
all-commits at lists.llvm.org
Wed Aug 28 23:33:02 PDT 2024
Branch: refs/heads/main
Home: https://github.com/llvm/llvm-project
Commit: 3b64ede096ce0a0230c4d3f77782e6fa18f2943a
https://github.com/llvm/llvm-project/commit/3b64ede096ce0a0230c4d3f77782e6fa18f2943a
Author: Luke Lau <luke at igalia.com>
Date: 2024-08-29 (Thu, 29 Aug 2024)
Changed paths:
M llvm/lib/Target/RISCV/RISCVISelLowering.cpp
M llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
M llvm/test/CodeGen/RISCV/rvv/named-vector-shuffle-reverse.ll
M llvm/test/CodeGen/RISCV/rvv/vp-reverse-int.ll
M llvm/test/CodeGen/RISCV/rvv/vp-reverse-mask.ll
Log Message:
-----------
[RISCV] Decompose LMUL > 1 reverses into LMUL * M1 vrgather.vv (#104574)
As far as I'm aware, vrgather.vv is quadratic in LMUL on most
microarchitectures today due to each output register needing to read
from each input register in the group.
For example, the reciprocal throughput for vrgather.vv on the
spacemit-x60 is listed on
https://camel-cdr.github.io/rvv-bench-results/bpi_f3 as:
LMUL1 LMUL2 LMUL4 LMUL8
4.0 16.0 64.0 256.1
Vector reverses are commonly emitted by the loop vectorizer and are
lowered as vrgather.vvs, but since the loop vectorizer uses LMUL 2 by
default they end up being quadratic.
The output registers in a reverse only need to read from one input
register though, so we can decompose this into LMUL * M1 vrgather.vvs to
get linear performance.
This gives a 0.43% runtime improvement on 526.blender_r at rva22u64_v O3
on the Banana Pi F3.
To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications
More information about the All-commits
mailing list