[llvm] r200621 - LoopVectorizer: Enable unrolling of conditional stores and the load/store

Mon Feb 3 10:14:34 PST 2014

I cannot reproduce this regression on my side.

On a Sandy-bridge machine I don’t see a regression for -O3 (without -mavx). The only difference between before/after in my test is that we unroll one vectorized loop by two:

diff before.noavx.ll after.noavx.ll 
453c453
<   %n.vec = and i64 %2, 8589934588
---
>   %n.vec = and i64 %2, 8589934584
462,469c462,479
<   %5 = icmp sgt <4 x i32> %wide.load, <i32 4, i32 4, i32 4, i32 4>
<   %6 = select <4 x i1> %5, <4 x i32> <i32 4, i32 4, i32 4, i32 4>, <4 x i32> zeroinitializer
<   %7 = getelementptr inbounds [2048 x i32]* @b, i64 0, i64 %index
<   %8 = bitcast i32* %7 to <4 x i32>*
<   store <4 x i32> %6, <4 x i32>* %8, align 16
<   %index.next = add i64 %index, 4
<   %9 = icmp eq i64 %index.next, %n.vec
<   br i1 %9, label %middle.block, label %vector.body, !llvm.loop !18
---
>   %.sum13 = or i64 %index, 4
>   %5 = getelementptr [2048 x i32]* @a, i64 0, i64 %.sum13
>   %6 = bitcast i32* %5 to <4 x i32>*
>   %wide.load10 = load <4 x i32>* %6, align 16
>   %7 = icmp sgt <4 x i32> %wide.load, <i32 4, i32 4, i32 4, i32 4>
>   %8 = icmp sgt <4 x i32> %wide.load10, <i32 4, i32 4, i32 4, i32 4>
>   %9 = select <4 x i1> %7, <4 x i32> <i32 4, i32 4, i32 4, i32 4>, <4 x i32> zeroinitializer
>   %10 = select <4 x i1> %8, <4 x i32> <i32 4, i32 4, i32 4, i32 4>, <4 x i32> zeroinitializer
>   %11 = getelementptr inbounds [2048 x i32]* @b, i64 0, i64 %index
>   %12 = bitcast i32* %11 to <4 x i32>*
>   store <4 x i32> %9, <4 x i32>* %12, align 16
>   %.sum14 = or i64 %index, 4
>   %13 = getelementptr [2048 x i32]* @b, i64 0, i64 %.sum14
>   %14 = bitcast i32* %13 to <4 x i32>*
>   store <4 x i32> %10, <4 x i32>* %14, align 16
>   %index.next = add i64 %index, 8
>   %15 = icmp eq i64 %index.next, %n.vec
>   br i1 %15, label %middle.block, label %vector.body, !llvm.loop !18
479,480c489,490
<   %10 = load i32* %arrayidx, align 4, !tbaa !5
<   %cmp1 = icmp sgt i32 %10, 4
---
>   %16 = load i32* %arrayidx, align 4, !tbaa !5
>   %cmp1 = icmp sgt i32 %16, 4

With -O3 -mavx I see a 4% improvement (we unroll more aggressively in quite a few cases by x4 instead of x2) on my machine.

The test machine is an AMD processor right? Maybe it has an issue with the loop above?

On Feb 3, 2014, at 8:59 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Tobias Grosser" <tobias at grosser.es>
>> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>, llvm-commits at cs.uiuc.edu
>> Sent: Monday, February 3, 2014 9:47:13 AM
>> Subject: Re: [llvm] r200621 - LoopVectorizer: Enable unrolling of conditional	stores and the load/store
>> 
>> On 02/02/2014 04:12 AM, Arnold Schwaighofer wrote:
>>> Author: arnolds
>>> Date: Sat Feb  1 21:12:34 2014
>>> New Revision: 200621
>>> 
>>> URL: http://llvm.org/viewvc/llvm-project?rev=200621&view=rev
>>> Log:
>>> LoopVectorizer: Enable unrolling of conditional stores and the
>>> load/store
>>> unrolling heuristic per default
>>> 
>>> Benchmarking on x86_64 (thanks Chandler!) and ARM has shown those
>>> options speed
>>> up some benchmarks while not causing any interesting regressions.
>> 
>> Just for your info, this change caused the following test-suite
>> changes
>> on X86:
>> 
>> Compile Time
>> 
>> 		Δ		Previous 	Current
>> pairlocalalign  +1.47%		10.35		10.50
>> 
>> 
>> Execution Time
>> 
>> 		Δ		Previous	Current
>> gcc-loops       + 4.59% 	4.4903 		4.6963
> 
> We should look at this one; it might be a fairly large regression in one of the loops in that benchmark.
> 
> -Hal
> 
>> 
>> linpack-pc	-11.84% 	7.6845		6.7744
>> ControlFlow-dbl -10.66% 	5.1763 		4.6243
>> airlocalalign 	- 1.06% 	26.1356 	25.8576
>> 
>> llvm.org/perf/db_default/v4/nts/21666?num_comparison_runs=10&aggregation_fn=median&compare_to=21661
>> 
>> 
>> Cheers,
>> Tobias
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory