[llvm] [NVPTX] Improve lowering of v4i8 (PR #67866)

Mon Oct 2 11:47:17 PDT 2023

================
@@ -3292,6 +3293,10 @@ let hasSideEffects = false in {
                              (ins Int16Regs:$s1, Int16Regs:$s2,
                                   Int16Regs:$s3, Int16Regs:$s4),
                              "mov.b64 \t$d, {{$s1, $s2, $s3, $s4}};", []>;
+  def V4I8toI32  : NVPTXInst<(outs Int32Regs:$d),
+                             (ins Int16Regs:$s1, Int16Regs:$s2,
+                                  Int16Regs:$s3, Int16Regs:$s4),
+                             "mov.b32 \t$d, {{$s1, $s2, $s3, $s4}};", []>;
----------------
Artem-B wrote:


According to the docs .B3 2nd is expected to be able to pack/unpack four 8-bit values: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-mov-2

![image](https://github.com/llvm/llvm-project/assets/526795/0497ae25-a8c8-4395-a221-b1a09fb3d404)

However, they do expect the arguments to be a .b8 values. 

move.b32 to/from elements of `v4.b8` appears to work: https://cuda.godbolt.org/z/qros4r6r1

https://github.com/llvm/llvm-project/pull/67866