[llvm] r208342 - [X86] Add target specific combine rules to fold SSE2/AVX2 packed arithmetic shift intrinsics.

Sat May 10 08:57:49 PDT 2014

mm.. That's really odd. Revision 208342 only affects how packed
SSE2/AVX2 arithmetic intrinsics are combined in the x86 backend.

+  case Intrinsic::x86_sse2_psrai_w:
+  case Intrinsic::x86_sse2_psrai_d:
+  case Intrinsic::x86_avx2_psrai_w:
+  case Intrinsic::x86_avx2_psrai_d:
+  case Intrinsic::x86_sse2_psra_w:
+  case Intrinsic::x86_sse2_psra_d:
+  case Intrinsic::x86_avx2_psra_w:
+  case Intrinsic::x86_avx2_psra_d:

I have done now a fresh checkout of the llvm test-sute
% svn co http://llvm.org/svn/llvm-project/test-suite/trunk test-suite

I did a recursive search of '_mm_sr' (grep -r _mm_sr) from the
test-suite root directory and this was the only match found:
SingleSource/UnitTests/Vector/SSE/sse.shift.c:  zeroones =
_mm_srli_epi16(allones, 8);

That intrinsic is for a logical packed shift (definitely not one of
the intrinsics optimized by my patch)..
I couldn't find any occurrence of avx2 intrinsics in the entire test suite.

No idea honestly why this change could have caused any regressions in
sphereflake..

On Sat, May 10, 2014 at 3:50 PM, Tobias Grosser <tobias at grosser.es> wrote:
> On 08/05/2014 19:44, Andrea Di Biagio wrote:
>>
>> Author: adibiagio
>> Date: Thu May  8 12:44:04 2014
>> New Revision: 208342
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=208342&view=rev
>> Log:
>> [X86] Add target specific combine rules to fold SSE2/AVX2 packed
>> arithmetic shift intrinsics.
>>
>> This patch teaches the backend how to combine packed SSE2/AVX2 arithmetic
>> shift
>> intrinsics.
>>
>> The rules are:
>>   - Always fold a packed arithmetic shift by zero to its first operand;
>>   - Convert a packed arithmetic shift intrinsic dag node into a ISD::SRA
>> only if
>>     the shift count is known to be smaller than the vector element size.
>>
>> This patch also teaches to function 'getTargetVShiftByConstNode' how fold
>> target specific vector shifts by zero.
>>
>> Added two new tests to verify that the DAGCombiner is able to fold
>> sequences of SSE2/AVX2 packed arithmetic shift calls.
>
>
> Hi Andrea,
>
> I see a execution time regression from 3.4s up to 6.9 seconds on my -O3
> buildbot for SingleSource/Benchmarks/Misc-C++/Large/sphereflake
>
> http://llvm.org/perf/db_default/v4/nts/graph?plot.0=34.174.2&highlight_run=25587
>
> between commits: 208335 and 208346
>
> From a quick look through the commits I believe this is the commit that most
> likely has caused this regression. Any idea if this change could cause such
> an regression on a Intel(R) Xeon(R) CPU E5430  @ 2.66GHz system?
>
> Cheers,
> Tobias
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits