[llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?
Neil Nelson via llvm-dev
llvm-dev at lists.llvm.org
Sat Jun 15 07:28:25 PDT 2019
Too bad this did not go anywhere. I was looking forward to a discovery
of this mystery.
On 6/12/19 3:58 PM, Joan Lluch via llvm-dev wrote:
> Hi all,
>
> In LSRInstance::CollectFixupsAndInitialFormulae() implementation, the
> function getMinusSCEV is called at a particular point as part of the
> code to replace/normalise 'x == y’ ICmp instructions into 'x - y ==
> 0’. However, what the actual code does is replacing them by ‘y - x ==
> 0’ instead.
>
> This is the excerpt of code in the
> LSRInstance::CollectFixupsAndInitialFormulae function I am referring to:
>
> // x == y --> x - y == 0
> const SCEV *N = SE.getSCEV(NV);
> if(SE.isLoopInvariant(N, L) && isSafeToExpand(N, SE)) {
> // S is normalized, so normalize N before folding it into S
> // to keep the result normalized.
> N = normalizeForPostIncUse(N, TmpPostIncLoops, SE);
> Kind = LSRUse::ICmpZero;
> S = SE.getMinusSCEV(N, S); // use instead S =
> SE.getMinusSCEV(S, N);
> }
>
>
> [ the comment “use instead" after the getMinusSCEV call is mine ]
>
>
> This apparently unimportant difference has a remarkable effect latter
> on, when the optimal solution for the loop is computed in
> LSRInstance::SolveRecurse
>
> In the function LSRInstance::SolveRecurse, given any two loop variants
> that are found to have the same cost, the first one is always chosen.
> This implies that the initial formula that was created in
> LSRInstance::CollectFixupsAndInitialFormulae() is the ones that will
> prevail by default (given same cost). The point that I want to
> highlight is that the current implementation of
> CollectFixupsAndInitialFormulae reverses the SCEVs found in the user
> source code rather than leaving them as found. This causes the
> following problem:
> Consider the following code (please don’t take into account whether
> this makes or not something useful, it’s just to show the point) :
>
> int asr_Lowert( int a, unsigned ammount )
> {
> if ( (ammount -= 8) >= 0 )
> a = a >> 8;
>
> while ( ammount-- )
> a >>= 1;
>
> returna;
> }
>
> The interesting part for this discussion is the ‘while’ loop. It gets
> compiled by Clang like this:
>
> while.cond: ; preds =
> %while.body, %entry
> %a.addr.1 = phi i32 [ %shr, %entry ], [ %shr1, %while.body ]
> %ammount.addr.0 = phi i32 [ %sub, %entry ], [ %dec, %while.body ]
> %tobool = icmp eq i32 %ammount.addr.0, 0
> br i1 %tobool, label %while.end, label %while.body
>
> while.body: ; preds = %while.cond
> %dec = add i32 %ammount.addr.0, -1
> %shr1 = ashr i32 %a.addr.1, 1
> br label %while.cond
>
>
> Clang mostly follows the source code and produces code that resembles
> it. It’s important to note is that variable ‘ ammount’ is decremented
> by 1 on each loop iteration as found on the user source code.
>
> I am going to chose the x86 target because I really want this subject
> to gain some attention, but please note that this is a general subject
> that affects ALL targets. So, for the x86, the resultant assembly code
> is this:
>
> .macosx_version_min 10, 12
> .globl_asr_Lowert
> _asr_Lowert:
> .cfi_startproc
> pushl%ebp
> .cfi_def_cfa_offset 8
> .cfi_offset %ebp, -8
> movl%esp, %ebp
> .cfi_def_cfa_register %ebp
> movl8(%ebp), %eax
> sarl$8, %eax
> pushl$8
> popl%ecx
> subl12(%ebp), %ecx
> jmpLBB0_1
> LBB0_2:
> sarl%eax
> incl%ecx
> LBB0_1:
> testl%ecx, %ecx
> jneLBB0_2
> popl%ebp
> retl
> .cfi_endproc
>
>
> What I want to show is that, now, instead of ‘ammount’ being
> decremented, it is incremented. Actually, the compiler performs the
> oddity of setting the initial value of the ‘while’ variable to
> '8-ammount’ rather than ‘ ammount-8’. The compiler even needs to
> perform a ‘push’ and a ‘pop’ as a setup for that. This is all because
> the SCEVs where swapped in the LLVM implementation of
> LSRInstance::CollectFixupsAndInitialFormulae().
>
> Now, let me show what happens if we implement the function as
> suggested in the comment, that is, by replacing the current code by S
> = SE.getMinusSCEV(S, N);
>
> This is what we get now from LLVM for the same input:
>
> .macosx_version_min 10, 12
> .globl_asr_Lowert
> _asr_Lowert:
> .cfi_startproc
> pushl%ebp
> .cfi_def_cfa_offset 8
> .cfi_offset %ebp, -8
> movl%esp, %ebp
> .cfi_def_cfa_register %ebp
> movl8(%ebp), %eax
> movl12(%ebp), %ecx
> addl$-8, %ecx
> sarl$8, %eax
> jmpLBB0_1
> LBB0_2:
> decl%ecx
> sarl%eax
> LBB0_1:
> testl%ecx, %ecx
> jneLBB0_2
> popl%ebp
> retl
> .cfi_endproc
>
>
> The code now uses decrement, and the initial ‘while’ variable value is
> ‘ ammount-8’ as we could expect from the user source code. The whole
> function is also shorter and it does not use potentially costly
> push/pop instructions.
>
> I also want to point out, that this is not an isolated case at all.
> This happens all the time for small loops where costs of several
> variants are the same. In all cases, decrements in the user source
> code are replaced by positive increments with necessarily strange
> setup code (as shown in the first example above). Positive increments
> are replaced by decrements, which also may force extra code generation
> because the loop induction variable can’t be any longer used for
> useful purposes. The remarkable thing is that by swapping the SCEV
> expressions as we currently do, the compiler stops taking the user
> source code into account as a hint on what to do, and instead it may
> produce worse code in circumstances where the user may have already
> chosen a naturally better implementation, which the compiler just
> destroys by artificially replacing it by a different solution with a
> supposed identical cost.
>
> Also, although this notably affects small loops with a few small-cost
> variants, it also affects more complex loops that may have two or more
> optimal solutions. Among the optimal solutions of the same cost, the
> algorithm will always chose the first one in the list, so again the
> one that will be chosen may be a reversed one with respect to the user
> code, unless it is a really different solution.
>
> I do not understand the reason behind such reversing of loop
> expressions, and that’s the question that I post. I understand that we
> may want to always favour decrements (maybe for performance reasons in
> some targets), or that we may want to prioritise increments (maybe to
> keep induction variables usable in addressing modes in some targets),
> or even that we may want to do different things for different targets.
> But I do not quite understand the current implementation. It almost
> seems to me a typo in the LLVM source code that nobody has come across
> before. I have had my local copy of LLVM for some time with the code
> set as in the comment, and I have yet to find a case where it produces
> worse code than the original.
>
> So, I really am interested in starting some informed discussion about
> this.
>
> John Lluch
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190615/33fafbb0/attachment.html>
More information about the llvm-dev
mailing list