[llvm] [RISCV] Support llvm.masked.expandload intrinsic (PR #101954)
Luke Lau via llvm-commits
llvm-commits at lists.llvm.org
Tue Aug 6 04:42:36 PDT 2024
================
@@ -250,50 +102,13 @@ declare <4 x i16> @llvm.masked.expandload.v4i16(ptr, <4 x i1>, <4 x i16>)
define <4 x i16> @expandload_v4i16(ptr %base, <4 x i16> %src0, <4 x i1> %mask) {
; CHECK-LABEL: expandload_v4i16:
; CHECK: # %bb.0:
-; CHECK-NEXT: vsetivli zero, 1, e8, m1, ta, ma
-; CHECK-NEXT: vmv.x.s a1, v0
-; CHECK-NEXT: andi a2, a1, 1
-; CHECK-NEXT: bnez a2, .LBB6_5
-; CHECK-NEXT: # %bb.1: # %else
-; CHECK-NEXT: andi a2, a1, 2
-; CHECK-NEXT: bnez a2, .LBB6_6
-; CHECK-NEXT: .LBB6_2: # %else2
-; CHECK-NEXT: andi a2, a1, 4
-; CHECK-NEXT: bnez a2, .LBB6_7
-; CHECK-NEXT: .LBB6_3: # %else6
-; CHECK-NEXT: andi a1, a1, 8
-; CHECK-NEXT: bnez a1, .LBB6_8
-; CHECK-NEXT: .LBB6_4: # %else10
-; CHECK-NEXT: ret
-; CHECK-NEXT: .LBB6_5: # %cond.load
-; CHECK-NEXT: lh a2, 0(a0)
-; CHECK-NEXT: vsetvli zero, zero, e16, m2, tu, ma
-; CHECK-NEXT: vmv.s.x v8, a2
-; CHECK-NEXT: addi a0, a0, 2
-; CHECK-NEXT: andi a2, a1, 2
-; CHECK-NEXT: beqz a2, .LBB6_2
-; CHECK-NEXT: .LBB6_6: # %cond.load1
-; CHECK-NEXT: lh a2, 0(a0)
-; CHECK-NEXT: vsetvli zero, zero, e16, m2, ta, ma
-; CHECK-NEXT: vmv.s.x v9, a2
-; CHECK-NEXT: vsetivli zero, 2, e16, mf2, tu, ma
-; CHECK-NEXT: vslideup.vi v8, v9, 1
-; CHECK-NEXT: addi a0, a0, 2
-; CHECK-NEXT: andi a2, a1, 4
-; CHECK-NEXT: beqz a2, .LBB6_3
-; CHECK-NEXT: .LBB6_7: # %cond.load5
-; CHECK-NEXT: lh a2, 0(a0)
-; CHECK-NEXT: vsetivli zero, 3, e16, mf2, tu, ma
-; CHECK-NEXT: vmv.s.x v9, a2
-; CHECK-NEXT: vslideup.vi v8, v9, 2
-; CHECK-NEXT: addi a0, a0, 2
-; CHECK-NEXT: andi a1, a1, 8
-; CHECK-NEXT: beqz a1, .LBB6_4
-; CHECK-NEXT: .LBB6_8: # %cond.load9
-; CHECK-NEXT: lh a0, 0(a0)
-; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
-; CHECK-NEXT: vmv.s.x v9, a0
-; CHECK-NEXT: vslideup.vi v8, v9, 3
+; CHECK-NEXT: vsetivli zero, 4, e8, mf4, ta, ma
+; CHECK-NEXT: vcpop.m a1, v0
+; CHECK-NEXT: vsetvli zero, a1, e16, mf2, ta, ma
+; CHECK-NEXT: vle16.v v9, (a0)
+; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, mu
+; CHECK-NEXT: viota.m v10, v0
+; CHECK-NEXT: vrgather.vv v8, v9, v10, v0.t
----------------
lukel97 wrote:
> These cases should easily exceed the throughput of indexed loads on these architectures
vrgather.vv doesn't perform the load though so I'm not sure if we can compare them directly. The vluxei*.v is kinda doing two in one.
I think the big performance concern is LMUL > 1, according to https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html it's 16 cycles at e8m2 and 64 at e8m4 on the BPI-F3. The loop vectorizer uses LMUL 2 by default, if it ever learns to emit expanded loads.
https://github.com/llvm/llvm-project/pull/101954
More information about the llvm-commits
mailing list