[llvm] 3fc1aad - [PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) (#154388)

Sun Aug 31 21:44:18 PDT 2025

Author: Tony Varghese
Date: 2025-09-01T10:14:12+05:30
New Revision: 3fc1aad65bc38d964b81091e57b1bcb3d56b8a3c

URL: https://github.com/llvm/llvm-project/commit/3fc1aad65bc38d964b81091e57b1bcb3d56b8a3c
DIFF: https://github.com/llvm/llvm-project/commit/3fc1aad65bc38d964b81091e57b1bcb3d56b8a3c.diff

LOG: [PowerPC] Merge vsr(vsro(input, byte_shift), bit_shift) to vsrq(input, res_bit_shift) (#154388)

This change implements a patfrag based pattern matching ~dag combiner~
that combines consecutive `VSRO (Vector Shift Right Octet)` and `VSR
(Vector Shift Right)` instructions into a single `VSRQ (Vector Shift
Right Quadword)` instruction on Power10+ processors.

Vector right shift operations like `vec_srl(vec_sro(input, byte_shift),
bit_shift)` generate two separate instructions `(VSRO + VSR)` when they
could be optimised into a single `VSRQ `instruction that performs the
equivalent operation.

```
vsr(vsro (input, vsro_byte_shift), vsr_bit_shift) to vsrq(input, vsrq_bit_shift) 
where vsrq_bit_shift = (vsro_byte_shift * 8) + vsr_bit_shift
```

Note:
```
 vsro : Vector Shift Right by Octet VX-form
- vsro VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bytes specified in bits 121:124 of VSR[VRB+32].
	- Bytes shifted out of byte 15 are lost. 
	- Zeros are supplied to the vacated bytes on the left.
- The result is placed into VSR[VRT+32].

vsr : Vector Shift Right VX-form
- vsr VRT, VRA, VRB
- The contents of VSR[VRA+32] are shifted right by the number of bits specified in bits 125:127 of VSR[VRB+32]. 3 bits.
	- Bits shifted out of bit 127 are lost.
	- Zeros are supplied to the vacated bits on the left.
- The result is place into VSR[VRT+32], except if, for any byte element in VSR[VRB+32], the low-order 3 bits are not equal to the shift amount, then VSR[VRT+32] is undefined.

vsrq : Vector Shift Right Quadword VX-form
- vsrq VRT,VRA,VRB 
- Let src1 be the contents of VSR[VRA+32]. Let src2 be the contents of VSR[VRB+32]. 
- src1 is shifted right by the number of bits specified in the low-order 7 bits of src2.
	- Bits shifted out the least-significant bit are lost. 
	- Zeros are supplied to the vacated bits on the left. 
	- The result is placed into VSR[VRT+32].
```

---------

Co-authored-by: Tony Varghese <tony.varghese at ibm.com>

Added: 
    llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll

Modified: 
    llvm/lib/Target/PowerPC/PPCISelLowering.cpp
    llvm/lib/Target/PowerPC/PPCISelLowering.h
    llvm/lib/Target/PowerPC/PPCInstrAltivec.td
    llvm/lib/Target/PowerPC/PPCInstrInfo.td
    llvm/lib/Target/PowerPC/PPCInstrP10.td

Removed: 
    


################################################################################
diff  --git a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
index 5039f5df7a128..eb9b376b4f62e 100644

--- a/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.cpp
@@ -1693,6 +1693,8 @@ const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {
   case PPCISD::XXPERM:
     return "PPCISD::XXPERM";
   case PPCISD::VECSHL:          return "PPCISD::VECSHL";
+  case PPCISD::VSRQ:
+    return "PPCISD::VSRQ";
   case PPCISD::CMPB:            return "PPCISD::CMPB";
   case PPCISD::Hi:              return "PPCISD::Hi";
   case PPCISD::Lo:              return "PPCISD::Lo";

diff  --git a/llvm/lib/Target/PowerPC/PPCISelLowering.h b/llvm/lib/Target/PowerPC/PPCISelLowering.h
index 6d5172ec247f6..669430550f4e6 100644
--- a/llvm/lib/Target/PowerPC/PPCISelLowering.h
+++ b/llvm/lib/Target/PowerPC/PPCISelLowering.h
@@ -498,6 +498,9 @@ namespace llvm {
     /// SETBCR - The ISA 3.1 (P10) SETBCR instruction.
     SETBCR,
 
+    /// VSRQ - The ISA 3.1 (P10) Vector Shift right quadword instruction
+    VSRQ,
+
     // NOTE: The nodes below may require PC-Rel specific patterns if the
     // address could be PC-Relative. When adding new nodes below, consider
     // whether or not the address can be PC-Relative and add the corresponding

diff  --git a/llvm/lib/Target/PowerPC/PPCInstrAltivec.td b/llvm/lib/Target/PowerPC/PPCInstrAltivec.td
index 3de0d57279061..97d5e28963234 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrAltivec.td
+++ b/llvm/lib/Target/PowerPC/PPCInstrAltivec.td
@@ -261,6 +261,13 @@ def immEQOneV : PatLeaf<(build_vector), [{
     return C->isOne();
   return false;
 }]>;
+
+def VSRVSRO : PatFrag<(ops node:$input, node:$shift), 
+                      (int_ppc_altivec_vsr 
+                        (int_ppc_altivec_vsro node:$input, node:$shift), 
+                        node:$shift), 
+                      [{ return N->getOperand(1).hasOneUse(); }]>;
+
 //===----------------------------------------------------------------------===//
 // Helpers for defining instructions that directly correspond to intrinsics.
 

diff  --git a/llvm/lib/Target/PowerPC/PPCInstrInfo.td b/llvm/lib/Target/PowerPC/PPCInstrInfo.td
index 6477154932c94..7cea9a15962a6 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrInfo.td
+++ b/llvm/lib/Target/PowerPC/PPCInstrInfo.td
@@ -58,6 +58,10 @@ def SDT_PPCVecShift : SDTypeProfile<1, 3, [ SDTCisVec<0>,
   SDTCisVec<1>, SDTCisVec<2>, SDTCisPtrTy<3>
 ]>;
 
+def SDT_PPCVecShiftQuad : SDTypeProfile<1, 2, [
+  SDTCisVec<0>, SDTCisSameAs<0,1>, SDTCisSameAs<0,2>
+]>;
+
 def SDT_PPCVecInsert : SDTypeProfile<1, 3, [ SDTCisVec<0>,
   SDTCisVec<1>, SDTCisVec<2>, SDTCisInt<3>
 ]>;
@@ -157,6 +161,8 @@ def PPCfctiwz : SDNode<"PPCISD::FCTIWZ", SDTFPUnaryOp, []>;
 def PPCfctiduz: SDNode<"PPCISD::FCTIDUZ",SDTFPUnaryOp, []>;
 def PPCfctiwuz: SDNode<"PPCISD::FCTIWUZ",SDTFPUnaryOp, []>;
 
+def PPCvsrq: SDNode<"PPCISD::VSRQ", SDT_PPCVecShiftQuad, []>;
+
 def PPCstrict_fcfid : SDNode<"PPCISD::STRICT_FCFID",
                              SDTFPUnaryOp, [SDNPHasChain]>;
 def PPCstrict_fcfidu : SDNode<"PPCISD::STRICT_FCFIDU",

diff  --git a/llvm/lib/Target/PowerPC/PPCInstrP10.td b/llvm/lib/Target/PowerPC/PPCInstrP10.td
index 75046d74b99b6..3a9b64c8dcd65 100644
--- a/llvm/lib/Target/PowerPC/PPCInstrP10.td
+++ b/llvm/lib/Target/PowerPC/PPCInstrP10.td
@@ -1918,7 +1918,8 @@ let Predicates = [IsISA3_1] in {
                         RegConstraint<"$VDi = $VD">;
   def VSLQ : VX1_VT5_VA5_VB5<261, "vslq", []>;
   def VSRAQ : VX1_VT5_VA5_VB5<773, "vsraq", []>;
-  def VSRQ : VX1_VT5_VA5_VB5<517, "vsrq", []>;
+  def VSRQ : VX1_VT5_VA5_VB5<517, "vsrq", 
+                            [(set v4i32:$VD, (PPCvsrq v4i32:$VA, v4i32:$VB))]>;
   def VRLQ : VX1_VT5_VA5_VB5<5, "vrlq", []>;
   def XSCVQPUQZ : X_VT5_XO5_VB5<63, 0, 836, "xscvqpuqz", []>;
   def XSCVQPSQZ : X_VT5_XO5_VB5<63, 8, 836, "xscvqpsqz", []>;
@@ -2053,6 +2054,9 @@ let Predicates = [IsISA3_1, HasFPU] in {
 
 //---------------------------- Anonymous Patterns ----------------------------//
 let Predicates = [IsISA3_1] in {
+  // Exploit vsrq instruction to optimize VSR(VSRO (input, vsro_byte_shift), vsr_bit_shift)
+  // to VSRQ(input, vsrq_bit_shift)
+  def : Pat<(VSRVSRO v4i32:$vA, v4i32:$vB), (VSRQ $vA, $vB)>;
   // Exploit the vector multiply high instructions using intrinsics.
   def : Pat<(v4i32 (int_ppc_altivec_vmulhsw v4i32:$vA, v4i32:$vB)),
             (v4i32 (VMULHSW $vA, $vB))>;

diff  --git a/llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll b/llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll
new file mode 100644
index 0000000000000..afbdae6dfa09a
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/vsro-vsr-vsrq-dag-combine.ll
@@ -0,0 +1,337 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr10 -mtriple=powerpc64le-unknown-linux-gnu \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER10-LE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr10 -mtriple=powerpc64-ibm-aix-xcoff \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER10-BE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr10 -mtriple=powerpc-ibm-aix-xcoff \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER3210-BE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER9-LE
+
+; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple=powerpc64-ibm-aix-xcoff \
+; RUN:   -ppc-asm-full-reg-names --ppc-vsr-nums-as-vr < %s | FileCheck %s --check-prefix=POWER9-BE
+
+; Test VSRO + VSR peephole optimization to VSRQ on Power10+
+; This should combine consecutive VSRO (Vector Shift Right Octet) and VSR (Vector Shift Right)
+; instructions using the same shift amount into a single VSRQ (Vector Shift Right Quadword)
+; instruction when targeting Power10 or later processors.
+declare <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32>, <4 x i32>)
+declare <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32>, <4 x i32>)
+
+define <16 x i8> @shiftright128_v16i8(<16 x i8> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v16i8:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v16i8:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v16i8:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v16i8:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v16i8:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <16 x i8>
+  ret <16 x i8> %4
+}
+
+define <4 x i32> @shiftright128_v4i32(<4 x i32> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v4i32:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v4i32:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v4i32:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v4i32:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v4i32:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %1 = tail call  <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %in, <4 x i32> %0)
+  %2 = tail call  <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %1, <4 x i32> %0)
+  ret <4 x i32> %2
+}
+
+define <2 x i64> @shiftright128_v2i64(<2 x i64> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v2i64:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v2i64:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v2i64:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v2i64:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v2i64:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <2 x i64> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <2 x i64>
+  ret <2 x i64> %4
+}
+
+define <8 x i16> @shiftright128_v8i16(<8 x i16> %in, i8 zeroext %sh) {
+; POWER10-LE-LABEL: shiftright128_v8i16:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsrq v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: shiftright128_v8i16:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsrq v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: shiftright128_v8i16:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsrq v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: shiftright128_v8i16:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: shiftright128_v8i16:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <8 x i16> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <8 x i16>
+  ret <8 x i16> %4
+}
+
+; Test case with 
diff erent vectors (should not optimize - 
diff erent shift amount registers)
+define <16 x i8> @no_optimization_
diff erent_shifts(<16 x i8> %in, i8 zeroext %sh1, i8 zeroext %sh2)  {
+; POWER10-LE-LABEL: no_optimization_
diff erent_shifts:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    mtvsrd v4, r6
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vspltb v4, v4, 7
+; POWER10-LE-NEXT:    vsro v2, v2, v3
+; POWER10-LE-NEXT:    vsr v2, v2, v4
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: no_optimization_
diff erent_shifts:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    mtvsrwz v4, r4
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vspltb v4, v4, 7
+; POWER10-BE-NEXT:    vsro v2, v2, v3
+; POWER10-BE-NEXT:    vsr v2, v2, v4
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: no_optimization_
diff erent_shifts:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    mtvsrwz v4, r4
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vspltb v4, v4, 7
+; POWER3210-BE-NEXT:    vsro v2, v2, v3
+; POWER3210-BE-NEXT:    vsr v2, v2, v4
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: no_optimization_
diff erent_shifts:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    mtvsrd v4, r6
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vspltb v4, v4, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v2, v2, v4
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: no_optimization_
diff erent_shifts:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    mtvsrwz v4, r4
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vspltb v4, v4, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v2, v2, v4
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh1, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %splat.splatinsert.i2 = insertelement <16 x i8> poison, i8 %sh2, i64 0
+  %splat.splat.i2 = shufflevector <16 x i8> %splat.splatinsert.i2, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = bitcast <16 x i8> %splat.splat.i2 to <4 x i32>
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %4 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %3, <4 x i32> %2)
+  %5 = bitcast <4 x i32> %4 to <16 x i8>
+  ret <16 x i8> %5
+}
+
+; Test case with multiple uses of VSRO result (should not optimize)
+define <16 x i8> @no_optimization_multiple_uses(<16 x i8> %in, i8 zeroext %sh)  {
+; POWER10-LE-LABEL: no_optimization_multiple_uses:
+; POWER10-LE:       # %bb.0: # %entry
+; POWER10-LE-NEXT:    mtvsrd v3, r5
+; POWER10-LE-NEXT:    vspltb v3, v3, 7
+; POWER10-LE-NEXT:    vsro v2, v2, v3
+; POWER10-LE-NEXT:    vsr v3, v2, v3
+; POWER10-LE-NEXT:    vaddubm v2, v2, v3
+; POWER10-LE-NEXT:    blr
+;
+; POWER10-BE-LABEL: no_optimization_multiple_uses:
+; POWER10-BE:       # %bb.0: # %entry
+; POWER10-BE-NEXT:    mtvsrwz v3, r3
+; POWER10-BE-NEXT:    vspltb v3, v3, 7
+; POWER10-BE-NEXT:    vsro v2, v2, v3
+; POWER10-BE-NEXT:    vsr v3, v2, v3
+; POWER10-BE-NEXT:    vaddubm v2, v2, v3
+; POWER10-BE-NEXT:    blr
+;
+; POWER3210-BE-LABEL: no_optimization_multiple_uses:
+; POWER3210-BE:       # %bb.0: # %entry
+; POWER3210-BE-NEXT:    mtvsrwz v3, r3
+; POWER3210-BE-NEXT:    vspltb v3, v3, 7
+; POWER3210-BE-NEXT:    vsro v2, v2, v3
+; POWER3210-BE-NEXT:    vsr v3, v2, v3
+; POWER3210-BE-NEXT:    vaddubm v2, v2, v3
+; POWER3210-BE-NEXT:    blr
+;
+; POWER9-LE-LABEL: no_optimization_multiple_uses:
+; POWER9-LE:       # %bb.0: # %entry
+; POWER9-LE-NEXT:    mtvsrd v3, r5
+; POWER9-LE-NEXT:    vspltb v3, v3, 7
+; POWER9-LE-NEXT:    vsro v2, v2, v3
+; POWER9-LE-NEXT:    vsr v3, v2, v3
+; POWER9-LE-NEXT:    vaddubm v2, v2, v3
+; POWER9-LE-NEXT:    blr
+;
+; POWER9-BE-LABEL: no_optimization_multiple_uses:
+; POWER9-BE:       # %bb.0: # %entry
+; POWER9-BE-NEXT:    mtvsrwz v3, r3
+; POWER9-BE-NEXT:    vspltb v3, v3, 7
+; POWER9-BE-NEXT:    vsro v2, v2, v3
+; POWER9-BE-NEXT:    vsr v3, v2, v3
+; POWER9-BE-NEXT:    vaddubm v2, v2, v3
+; POWER9-BE-NEXT:    blr
+entry:
+  %splat.splatinsert.i = insertelement <16 x i8> poison, i8 %sh, i64 0
+  %splat.splat.i = shufflevector <16 x i8> %splat.splatinsert.i, <16 x i8> poison, <16 x i32> zeroinitializer
+  %0 = bitcast <16 x i8> %in to <4 x i32>
+  %1 = bitcast <16 x i8> %splat.splat.i to <4 x i32>
+  %2 = tail call <4 x i32> @llvm.ppc.altivec.vsro(<4 x i32> %0, <4 x i32> %1)
+  %3 = tail call <4 x i32> @llvm.ppc.altivec.vsr(<4 x i32> %2, <4 x i32> %1)
+  %4 = bitcast <4 x i32> %3 to <16 x i8>
+  %5 = bitcast <4 x i32> %2 to <16 x i8>
+  %6 = add <16 x i8> %5, %4
+  ret <16 x i8> %6
+}