[llvm-commits] [llvm-testresults] lab-mini-01O3-plainclang_DEV__x86_64 test results

Fri May 25 15:07:18 PDT 2012

On Fri, May 25, 2012 at 8:28 AM, Duncan Sands <baldrick at free.fr> wrote:

> This 44% performance regression was caused by my reassociate changes.  The
> reason is pretty interesting though.  I could do with some suggestions.
>
> > Performance Regressions - Execution Time      Δ       Previous
>  Current σ       Δ (B)   σ (B)
> > SingleSource/Benchmarks/BenchmarkGame/puzzle
> > <http://llvm.org/perf/db_default/v4/nts/789/graph?test.198=2> 44.42%
>  0.4829
> > 0.6974        0.0001  43.91%  0.0001
>
> The change to the optimized IR was:
>
>    %phi213.i = phi i32 [ %xor1.i, %for.body.i ], [ 0,
> %for.body.i.preheader ]
>    %indvars.iv.next.i = add i64 %indvars.iv.i, 1
>    %arrayidx.i = getelementptr inbounds i32* %call, i64 %indvars.iv.i
>    %0 = load i32* %arrayidx.i, align 4, !tbaa !0
>    %1 = trunc i64 %indvars.iv.next.i to i32
> -  %xor.i = xor i32 %0, %phi213.i
> -  %xor1.i = xor i32 %xor.i, %1
> +  %xor.i = xor i32 %1, %phi213.i
> +  %xor1.i = xor i32 %xor.i, %0
>    %exitcond = icmp eq i32 %1, 500001
>    br i1 %exitcond, label %findDuplicate.exit, label %for.body.i
>
> The old code computes
>   %phi213.i ^ %0 ^ %1
> while the new computes
>   %phi213.i ^ %1 ^ %0
> Here %0 is a load and %1 is a truncation.
>
> Since reassociate computes the same rank for %0 and %1, there is no reason
> to
> prefer one to the other - it's just a matter of chance which one you get,
> and
> the old code was luckier than the new.
>
> The reason for the big slowdown is in the different codegen:
>
> # phi213.i is in %ebx
>
> +       leaq    1(%rdx), %rsi
> +       xorl    %esi, %ebx
>         xorl    (%rax,%rdx,4), %ebx
> -       incq    %rdx
> -       xorl    %edx, %ebx
> -       cmpl    $500001, %edx           # imm = 0x7A121
> +       cmpl    $500001, %esi           # imm = 0x7A121
> +       movq    %rsi, %rdx
>
> I'm not sure why this codegen difference arises.  Any suggestions?
>

One, largely uninformed idea is the dependency chains (in addition to the
leaq stuff mentioned by Jakob):

Old code, we execute the load+xor first, and while in flight we can
increment rdx, and execute the comparison with edx. whenever the load+xor
lands, we can execute the second xor.

New code, the leaq must execute before the first xor, and the first xor
must execute before the second xor. That means we may not be able to
overlap as much of the load delay.

I'm not sure how to fix this though... It's not clear that there is a good
place for reassociate to try to locate likely-to-load operations. Maybe it
could try to sink operations with unrelated prerequisite transformations
toward the end of a chain?

>
> If there is a fairly generic explanation for the different codegen, maybe
> the
> rank function can be tweaked to force the more effective order.
>
> Ciao, Duncan.
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120525/b7fff6c5/attachment.html>

[llvm-commits] [llvm-testresults] lab-mini-01__O3-plain__clang_DEV__x86_64 test results

[llvm-commits] [llvm-testresults] lab-mini-01O3-plainclang_DEV__x86_64 test results