[llvm] [NVPTX] Generalize and extend upsizing when lowering 8/16-bit-element vector loads/stores (PR #119622)
Artem Belevich via llvm-commits
llvm-commits at lists.llvm.org
Wed Dec 11 15:48:42 PST 2024
================
@@ -194,6 +194,156 @@ define void @generic_4xi8(ptr %a) {
ret void
}
+define void @generic_8xi8(ptr %a) {
+; CHECK-LABEL: generic_8xi8(
+; CHECK: {
+; CHECK-NEXT: .reg .b16 %rs<17>;
+; CHECK-NEXT: .reg .b32 %r<25>;
+; CHECK-NEXT: .reg .b64 %rd<2>;
+; CHECK-EMPTY:
+; CHECK-NEXT: // %bb.0:
+; CHECK-NEXT: ld.param.u64 %rd1, [generic_8xi8_param_0];
+; CHECK-NEXT: ld.v2.b32 {%r1, %r2}, [%rd1];
+; CHECK-NEXT: bfe.u32 %r3, %r2, 24, 8;
+; CHECK-NEXT: cvt.u16.u32 %rs1, %r3;
+; CHECK-NEXT: add.s16 %rs2, %rs1, 1;
+; CHECK-NEXT: cvt.u32.u16 %r4, %rs2;
+; CHECK-NEXT: bfe.u32 %r5, %r2, 16, 8;
+; CHECK-NEXT: cvt.u16.u32 %rs3, %r5;
+; CHECK-NEXT: add.s16 %rs4, %rs3, 1;
+; CHECK-NEXT: cvt.u32.u16 %r6, %rs4;
----------------
Artem-B wrote:
This looks like another optimization opportunity. When we need to extract 4 i8 values, into i16 it may be faster to do this way:
Let's assume input `I = i32 XXYYZZWW`. We need to produce four `i16` values `XX/YY/ZZ/WW`.
```
PRMT %ZW, %I, 0, 0x4140; // ZW = 0x00ZZ00WW
PRMT %XY, %I, 0, 0x4342; // XY = 0x00XX00WW
mov.b32 {%rsZ, %rsW}, %ZW
mov.b32 {%rsX, %rsY}, %XY
```
On GPUs that support v2i16 operations, we could do them directly on `%ZW` and `%XY`.
This should probably go into a separate patch.
https://github.com/llvm/llvm-project/pull/119622
More information about the llvm-commits
mailing list