[llvm] [LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization (PR #159388)
Drew Kersnar via llvm-commits
llvm-commits at lists.llvm.org
Mon Sep 22 10:20:39 PDT 2025
================
@@ -45,29 +45,31 @@ define half @fh(ptr %p) {
; ENABLED-LABEL: fh(
; ENABLED: {
; ENABLED-NEXT: .reg .b16 %rs<10>;
-; ENABLED-NEXT: .reg .b32 %r<13>;
+; ENABLED-NEXT: .reg .b32 %r<17>;
; ENABLED-NEXT: .reg .b64 %rd<2>;
; ENABLED-EMPTY:
; ENABLED-NEXT: // %bb.0:
; ENABLED-NEXT: ld.param.b64 %rd1, [fh_param_0];
-; ENABLED-NEXT: ld.v4.b16 {%rs1, %rs2, %rs3, %rs4}, [%rd1];
-; ENABLED-NEXT: ld.b16 %rs5, [%rd1+8];
-; ENABLED-NEXT: cvt.f32.f16 %r1, %rs2;
-; ENABLED-NEXT: cvt.f32.f16 %r2, %rs1;
-; ENABLED-NEXT: add.rn.f32 %r3, %r2, %r1;
-; ENABLED-NEXT: cvt.rn.f16.f32 %rs6, %r3;
-; ENABLED-NEXT: cvt.f32.f16 %r4, %rs4;
-; ENABLED-NEXT: cvt.f32.f16 %r5, %rs3;
-; ENABLED-NEXT: add.rn.f32 %r6, %r5, %r4;
-; ENABLED-NEXT: cvt.rn.f16.f32 %rs7, %r6;
-; ENABLED-NEXT: cvt.f32.f16 %r7, %rs7;
-; ENABLED-NEXT: cvt.f32.f16 %r8, %rs6;
-; ENABLED-NEXT: add.rn.f32 %r9, %r8, %r7;
-; ENABLED-NEXT: cvt.rn.f16.f32 %rs8, %r9;
-; ENABLED-NEXT: cvt.f32.f16 %r10, %rs8;
-; ENABLED-NEXT: cvt.f32.f16 %r11, %rs5;
-; ENABLED-NEXT: add.rn.f32 %r12, %r10, %r11;
-; ENABLED-NEXT: cvt.rn.f16.f32 %rs9, %r12;
+; ENABLED-NEXT: ld.v4.b32 {%r1, %r2, %r3, %r4}, [%rd1];
----------------
dakersnar wrote:
Note the difference in number of ld instructions in the PTX. The old output has two load instructions to load 5 b16s: a ld.v4.b16 and a ld.b16. The new version, in the LSV, "extends" the chain of 5 loads to the next power of two, a chain of 8 loads with 3 unused tail elements, vectorizing it a single `load <8 x i16>`. This gets lowered by the backend to a `ld.v4.b32`, with 2.5 elements (containing the packed 5 b16s) ending up being used, the rest unused.
This reduction from two load instructions to one load instruction is an optimization.
https://github.com/llvm/llvm-project/pull/159388
More information about the llvm-commits
mailing list