<div dir="ltr">Hello,<div><br></div><div>We are investigating a difference in code generation for vector splat instructions between llvm-3.9 and llvm-4.0, which could lead to a performance regression for our target. Here is the C snippet</div><div><br></div><div>typedef signed v8i16_t __attribute__((ext_vector_type(8)))<br></div><div><div><br></div><div>v8i16_t foo (v8i16 a, int n)</div><div>{</div><div>   return result = a >> n;<br></div><div>}<br></div></div><div><br></div><div>With llvm-3.9, the generated sequence does a trunc followed by splat, but with llvm-4.0 it is reversed to a splat to a bigger vector followed by a v8i32->v8i16 trunc. Is this by design? The earlier code sequence is definitely better for our target, but are there known scenarios where the new sequence would lead to better code?</div><div><br></div><div>Here are the instruction sequences generated in the two cases:</div><div><br></div><div>With llvm 3.9:</div><div><br></div><div><div>define <8 x i16> @foo(<8 x i16>, i32) #0 {</div><div>  %3 = trunc i32 %1 to i16</div><div>  %4 = insertelement <8 x i16> undef, i16 %3, i32 0</div><div>  %5 = shufflevector <8 x i16> %4, <8 x i16> undef, <8 x i32> zeroinitializer</div><div>  %6 = ashr <8 x i16> %0, %5</div><div>  ret <8 x i16> %6</div><div>}</div></div><div><br></div><div><br></div><div>With llvm 4.0:</div><div><br></div><div><div>define <8 x i16> @foo(<8 x i16>, i32) #0 {</div><div>  %3 = insertelement <8 x i32> undef, i32 %1, i32 0</div><div>  %4 = shufflevector <8 x i32> %3, <8 x i32> undef, <8 x i32> zeroinitializer</div><div>  %5 = trunc <8 x i32> %4 to <8 x i16></div><div>  %6 = ashr <8 x i16> %0, %5</div><div>  ret <8 x i16> %6</div><div>}</div></div><div><br></div><div>Best regards</div><div>Saurabh Verma</div></div>