[llvm-commits] [llvm-testresults] lab-mini-01O3-plainclang_DEV__x86_64 test results

Sat May 26 00:39:47 PDT 2012

Hi Andrew,

On 26/05/12 03:17, Andrew Trick wrote:
> On May 25, 2012, at 8:28 AM, Duncan Sands<baldrick at free.fr>  wrote:
>
>> This 44% performance regression was caused by my reassociate changes.  The
>> reason is pretty interesting though.  I could do with some suggestions.
>>
>>> Performance Regressions - Execution Time 	Δ	Previous	Current	σ 	Δ (B) 	σ (B)
>>> SingleSource/Benchmarks/BenchmarkGame/puzzle
>>> <http://llvm.org/perf/db_default/v4/nts/789/graph?test.198=2>	44.42%	0.4829
>>> 0.6974	0.0001	43.91%	0.0001
>>
>> The change to the optimized IR was:
>>
>>     %phi213.i = phi i32 [ %xor1.i, %for.body.i ], [ 0, %for.body.i.preheader ]
>>     %indvars.iv.next.i = add i64 %indvars.iv.i, 1
>>     %arrayidx.i = getelementptr inbounds i32* %call, i64 %indvars.iv.i
>>     %0 = load i32* %arrayidx.i, align 4, !tbaa !0
>>     %1 = trunc i64 %indvars.iv.next.i to i32
>> -  %xor.i = xor i32 %0, %phi213.i
>> -  %xor1.i = xor i32 %xor.i, %1
>> +  %xor.i = xor i32 %1, %phi213.i
>> +  %xor1.i = xor i32 %xor.i, %0
>>     %exitcond = icmp eq i32 %1, 500001
>>     br i1 %exitcond, label %findDuplicate.exit, label %for.body.i
>>
>> The old code computes
>>    %phi213.i ^ %0 ^ %1
>> while the new computes
>>    %phi213.i ^ %1 ^ %0
>> Here %0 is a load and %1 is a truncation.
>>
>> Since reassociate computes the same rank for %0 and %1, there is no reason to
>> prefer one to the other - it's just a matter of chance which one you get, and
>> the old code was luckier than the new.
>>
>> The reason for the big slowdown is in the different codegen:
>>
>> # phi213.i is in %ebx
>>
>> +       leaq    1(%rdx), %rsi
>> +       xorl    %esi, %ebx
>>          xorl    (%rax,%rdx,4), %ebx
>> -       incq    %rdx
>> -       xorl    %edx, %ebx
>> -       cmpl    $500001, %edx           # imm = 0x7A121
>> +       cmpl    $500001, %esi           # imm = 0x7A121
>> +       movq    %rsi, %rdx
>>
>> I'm not sure why this codegen difference arises.  Any suggestions?
>>
>> If there is a fairly generic explanation for the different codegen, maybe the
>> rank function can be tweaked to force the more effective order.
>
> Hello Duncan,
>
> This is a micro-architectural glass jaw that we should not attempt to compensate for in the optimizer. For example, moving the loaded value upward in the dependence chain is not generally a good thing. If this particular case mattered enough, we would need to specifically target the problem in codegen, ideally between coalescing and scheduling where we know how many cycles the loop will take and which resources are available in those cycles. At that point we could reassociate the xors, unfold the load to expose a coalescing opportunity. Or we could simply sink the copy across the loop back to allow fusing the cmp+jne.

thanks for the explanation, I will now try to understand it :)

> But on to my real point. I think it's important not to arbitrarily reassociate, or otherwise canonicalize, unless the canonical form is clearly superior in exposing real optimization. You say that you've made an arbitrary decision to select one form over another. In that situation, we should try hard to preserve the original expression.

The situation is completely the other way round: the old pass arbitrarily
flipped the order of the equal rank values, while my new version carefully
preserves their order in the IR.  Unfortunately, the original order is the
one which results in worse code!

  Two reasons for this:
>
> (1) We lose information about intermediate values. This means we have to throw away any value-specific annotations: NSW/NUW flags, debug information, things like value profile if we had it. We have a serious problem already when the Reassociate pass drops NSW flags, inhibiting important optimization.

I didn't mention it in the commit log, but the new version is much less
destructive and tries harder to preserve nsw flags: if the rewritten
expression is the same as the original (modulo trivial changes) then all
nsw etc flags are preserved.  The previous version would only preserve
flags if the expression involved only one operation (eg: a+b, not a+b+c).
The old pass would drop all uses of leaves by writing undef into the parent's
operands; the new version only writes in an undef in order to force a tree
topology onto the expression, for example if there is a diamond use pattern
then one leg of the diamond gets zapped.  If the expression already has the
form reassociate creates (a linear sequence of operations) or in fact has any
form that is a tree then there is no destruction at this stage.  When writing
out the new optimized expression (which is written in place in top of the old)
that makes it easy to tell if you are actually changing anything: compare the
new operands with the old ones, which are still all present now.  The rewriting
stage takes care to ensure that if the original expression was already in the
desired form then literally no changes are made when writing out the new.

> (2) We introduce arbitrary performance variations, as you just noticed, which take a lot of time to track down. It becomes harder to provide hints to the compiler to guide codegen.
>
> A while back, I was planning to rewrite Reassociate to preserve flags when possible. That fell by the wayside, but I'm becoming concerned that it will be harder to fix now that the pass is becoming more sophisticated based on the old design of throwing away the original expression. If there's any way you can think of having Reassociate bias expressions toward their original form, that would be helpful.

As mentioned above, I quietly took care of this this in my commit.  The pass has
become more sophisticated, but it is now also conceptually simpler.  Previously
the linearization stage tried to deal with tricky expressions by locally
"rotating" parts of the expression, which introduced some pretty arbitrary
changes, and also was only effective in simple cases.  If the expression forms
a tree, the new code just does a depth-first search gathering the leaves.  If
the expression does not form a tree, it still in effect does a depth-first
search, it just keeps track of how many different ways it found to get to a
leaf.

If you look at the problematic IR being discussed here, you can see that the
original expression did form a tree, but the old code for some reason reordered
part of it, while the new code did not.

Ciao, Duncan.

[llvm-commits] [llvm-testresults] lab-mini-01__O3-plain__clang_DEV__x86_64 test results

[llvm-commits] [llvm-testresults] lab-mini-01O3-plainclang_DEV__x86_64 test results