[llvm-commits] [llvm-testresults] lab-mini-01O3-plainclang_DEV__x86_64 test results

Fri May 25 19:38:19 PDT 2012

On Fri, 25 May 2012 18:17:36 -0700
Andrew Trick <atrick at apple.com> wrote:

> On May 25, 2012, at 8:28 AM, Duncan Sands <baldrick at free.fr> wrote:
> 
> > This 44% performance regression was caused by my reassociate
> > changes.  The reason is pretty interesting though.  I could do with
> > some suggestions.
> > 
> >> Performance Regressions - Execution Time 	Δ
> >> Previous	Current	σ 	Δ (B) 	σ (B)
> >> SingleSource/Benchmarks/BenchmarkGame/puzzle
> >> <http://llvm.org/perf/db_default/v4/nts/789/graph?test.198=2>
> >> 44.42%	0.4829 0.6974	0.0001	43.91%
> >> 0.0001
> > 
> > The change to the optimized IR was:
> > 
> >    %phi213.i = phi i32 [ %xor1.i, %for.body.i ], [ 0,
> > %for.body.i.preheader ] %indvars.iv.next.i = add i64 %indvars.iv.i,
> > 1 %arrayidx.i = getelementptr inbounds i32* %call, i64 %indvars.iv.i
> >    %0 = load i32* %arrayidx.i, align 4, !tbaa !0
> >    %1 = trunc i64 %indvars.iv.next.i to i32
> > -  %xor.i = xor i32 %0, %phi213.i
> > -  %xor1.i = xor i32 %xor.i, %1
> > +  %xor.i = xor i32 %1, %phi213.i
> > +  %xor1.i = xor i32 %xor.i, %0
> >    %exitcond = icmp eq i32 %1, 500001
> >    br i1 %exitcond, label %findDuplicate.exit, label %for.body.i
> > 
> > The old code computes
> >   %phi213.i ^ %0 ^ %1
> > while the new computes
> >   %phi213.i ^ %1 ^ %0
> > Here %0 is a load and %1 is a truncation.
> > 
> > Since reassociate computes the same rank for %0 and %1, there is no
> > reason to prefer one to the other - it's just a matter of chance
> > which one you get, and the old code was luckier than the new.
> > 
> > The reason for the big slowdown is in the different codegen:
> > 
> > # phi213.i is in %ebx
> > 
> > +       leaq    1(%rdx), %rsi
> > +       xorl    %esi, %ebx
> >         xorl    (%rax,%rdx,4), %ebx
> > -       incq    %rdx
> > -       xorl    %edx, %ebx
> > -       cmpl    $500001, %edx           # imm = 0x7A121
> > +       cmpl    $500001, %esi           # imm = 0x7A121
> > +       movq    %rsi, %rdx
> > 
> > I'm not sure why this codegen difference arises.  Any suggestions?
> > 
> > If there is a fairly generic explanation for the different codegen,
> > maybe the rank function can be tweaked to force the more effective
> > order.
> 
> Hello Duncan,
> 
> This is a micro-architectural glass jaw that we should not attempt to
> compensate for in the optimizer. For example, moving the loaded value
> upward in the dependence chain is not generally a good thing. If this
> particular case mattered enough, we would need to specifically target
> the problem in codegen, ideally between coalescing and scheduling
> where we know how many cycles the loop will take and which resources
> are available in those cycles. At that point we could reassociate the
> xors, unfold the load to expose a coalescing opportunity. Or we could
> simply sink the copy across the loop back to allow fusing the cmp+jne.
> 
> But on to my real point. I think it's important not to arbitrarily
> reassociate, or otherwise canonicalize, unless the canonical form is
> clearly superior in exposing real optimization.

Andy,

What do you mean by 'clearly'? We'd need to define some metric for this,
and I'm not sure what that should be.

This interests me because I also need some procedure for
reassociating in order to have basic-block vectorization do something
interesting for reductions. To start, I'd want a+b+c+d+e+f+g+h,
regardless of the original association, to be transformed into:
(a+b)+(c+d)+(e+f)+(g+h) or (a+b+c+d)+(e+f+g+h) [the number of groups
should depend on the target's vector length, and maybe some other
things as well].

I'm not sure whether I should try to bake this into Reassociate, or
refactor Reassociate so that parts of it can be used by BBVectorize, or
something else. Do you have an opinion?

Also, regarding Jakob's point about the µops and the scheduling, should
we try to teach the instruction scheduler about µop fusing? Would this
be as simple as preferring a next instruction fusable with the current
one, or would be need to take reordering into account as well?

Thanks again,
Hal

> You say that you've
> made an arbitrary decision to select one form over another. In that
> situation, we should try hard to preserve the original expression.
> Two reasons for this:
> 
> (1) We lose information about intermediate values. This means we have
> to throw away any value-specific annotations: NSW/NUW flags, debug
> information, things like value profile if we had it. We have a
> serious problem already when the Reassociate pass drops NSW flags,
> inhibiting important optimization.
> 
> (2) We introduce arbitrary performance variations, as you just
> noticed, which take a lot of time to track down. It becomes harder to
> provide hints to the compiler to guide codegen.
> 
> A while back, I was planning to rewrite Reassociate to preserve flags
> when possible. That fell by the wayside, but I'm becoming concerned
> that it will be harder to fix now that the pass is becoming more
> sophisticated based on the old design of throwing away the original
> expression. If there's any way you can think of having Reassociate
> bias expressions toward their original form, that would be helpful.
> 
> -Andy
> 
> 
> 
> 
>  
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

[llvm-commits] [llvm-testresults] lab-mini-01__O3-plain__clang_DEV__x86_64 test results

[llvm-commits] [llvm-testresults] lab-mini-01O3-plainclang_DEV__x86_64 test results