[llvm] [VectorCombine] Prevent redundant cost computation for repeated operand pairs in foldShuffleOfIntrinsics (PR #171965)

Fri Dec 12 04:08:34 PST 2025

Bhuvan1527 wrote:

> Next step is to confirm it works for #170867 and add the test to llvm\test\Transforms\VectorCombine\X86\shuffle-of-intrinsics.ll

Hi @RKSimon 
In the shuffle-of-intrinsics.ll file, we will check this file by running it for different CPUs with -mcpu option. With our code changes, the output generated is different for different cpus. 

For -mcpu=x86-64 and x86-64-v2 , the folding will be:

`define <8 x float> @src(<4 x float> %x0, <4 x float> %x1, <4 x float> %y0, <4 x float> %y1) #0 {`
  %1 = shufflevector <4 x float> %x1, <4 x float> %y1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %2 = shufflevector <4 x float> %x1, <4 x float> %y1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %3 = shufflevector <4 x float> %x0, <4 x float> %y0, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %4 = shufflevector <4 x float> %x0, <4 x float> %y0, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %5 = call <8 x float> @llvm.fma.v8f32(<8 x float> %3, <8 x float> %4, <8 x float> zeroinitializer)
  %res = call <8 x float> @llvm.fma.v8f32(<8 x float> %1, <8 x float> %2, <8 x float> %5)
  ret <8 x float> %res
}

For -mcpu=x86-64-v3 and x86-64-v4, the folding will be:

`define <8 x float> @src(<4 x float> %x0, <4 x float> %x1, <4 x float> %y0, <4 x float> %y1) #0 {`
  %1 = shufflevector <4 x float> %x1, <4 x float> %y1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %2 = shufflevector <4 x float> %x0, <4 x float> %y0, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %3 = call <8 x float> @llvm.fma.v8f32(<8 x float> %2, <8 x float> %2, <8 x float> zeroinitializer)
  %res = call <8 x float> @llvm.fma.v8f32(<8 x float> %1, <8 x float> %1, <8 x float> %3)
  ret <8 x float> %res
}

https://github.com/llvm/llvm-project/pull/171965