[llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?

Sat Jun 15 07:28:25 PDT 2019

Too bad this did not go anywhere. I was looking forward to a discovery 
of this mystery.

On 6/12/19 3:58 PM, Joan Lluch via llvm-dev wrote:
> Hi all,
>
> In LSRInstance::CollectFixupsAndInitialFormulae() implementation, the 
> function getMinusSCEV is called at a particular point as part of the 
> code to replace/normalise 'x == y’  ICmp instructions into 'x - y == 
> 0’. However, what the actual code does is replacing them by ‘y - x == 
> 0’ instead.
>
> This is the excerpt of code in the 
> LSRInstance::CollectFixupsAndInitialFormulae function I am referring to:
>
>         // x == y  -->  x - y == 0
> const SCEV *N = SE.getSCEV(NV);
> if(SE.isLoopInvariant(N, L) && isSafeToExpand(N, SE)) {
> // S is normalized, so normalize N before folding it into S
> // to keep the result normalized.
>           N = normalizeForPostIncUse(N, TmpPostIncLoops, SE);
>           Kind = LSRUse::ICmpZero;
>           S = SE.getMinusSCEV(N, S);    // use instead S = 
> SE.getMinusSCEV(S, N);
>         }
>
>
> [ the comment “use instead" after the getMinusSCEV call is mine ]
>
>
> This apparently unimportant difference has a remarkable effect latter 
> on, when the optimal solution for the loop is computed in 
> LSRInstance::SolveRecurse
>
> In the function LSRInstance::SolveRecurse, given any two loop variants 
> that are found to have the same cost, the first one is always chosen. 
> This implies that the initial formula that was created in 
> LSRInstance::CollectFixupsAndInitialFormulae() is the ones that will 
> prevail by default (given same cost). The point that I want to 
> highlight is that the current implementation of 
> CollectFixupsAndInitialFormulae reverses the SCEVs found in the user 
> source code rather than leaving them as found. This causes the 
> following problem:
> Consider the following code (please don’t take into account whether 
> this makes or not something useful, it’s just to show the point) :
>
> int asr_Lowert( int a, unsigned ammount )
> {
> if ( (ammount -= 8) >= 0  )
>     a = a >> 8;
>
> while ( ammount-- )
>     a >>= 1;
>
> returna;
> }
>
> The interesting part for this discussion is the ‘while’ loop. It gets 
> compiled by Clang like this:
>
> while.cond:                                       ; preds = 
> %while.body, %entry
>   %a.addr.1 = phi i32 [ %shr, %entry ], [ %shr1, %while.body ]
>   %ammount.addr.0 = phi i32 [ %sub, %entry ], [ %dec, %while.body ]
>   %tobool = icmp eq i32 %ammount.addr.0, 0
>   br i1 %tobool, label %while.end, label %while.body
>
> while.body:                                       ; preds = %while.cond
>   %dec = add i32 %ammount.addr.0, -1
>   %shr1 = ashr i32 %a.addr.1, 1
>   br label %while.cond
>
>
> Clang mostly follows the source code and produces code that resembles 
> it. It’s important to note is that variable ‘ ammount’ is decremented 
> by 1 on each loop iteration as found on the user source code.
>
> I am going to chose the x86 target because I really want this subject 
> to gain some attention, but please note that this is a general subject 
> that affects ALL targets. So, for the x86, the resultant assembly code 
> is this:
>
> .macosx_version_min 10, 12
> .globl_asr_Lowert
> _asr_Lowert:
> .cfi_startproc
> pushl%ebp
> .cfi_def_cfa_offset 8
> .cfi_offset %ebp, -8
> movl%esp, %ebp
> .cfi_def_cfa_register %ebp
> movl8(%ebp), %eax
> sarl$8, %eax
> pushl$8
> popl%ecx
> subl12(%ebp), %ecx
> jmpLBB0_1
> LBB0_2:
> sarl%eax
> incl%ecx
> LBB0_1:
> testl%ecx, %ecx
> jneLBB0_2
> popl%ebp
> retl
> .cfi_endproc
>
>
> What I want to show is that, now, instead of ‘ammount’ being 
> decremented, it is incremented. Actually, the compiler performs the 
> oddity of setting the initial value of the ‘while’ variable to 
> '8-ammount’ rather than ‘ ammount-8’. The compiler even needs to 
> perform a ‘push’ and a ‘pop’ as a setup for that. This is all because 
> the SCEVs where swapped in the LLVM implementation of 
> LSRInstance::CollectFixupsAndInitialFormulae().
>
> Now, let me show what happens if we implement the function as 
> suggested in the comment, that is, by replacing the current code by S 
> = SE.getMinusSCEV(S, N);
>
> This is what we get now from LLVM for the same input:
>
> .macosx_version_min 10, 12
> .globl_asr_Lowert
> _asr_Lowert:
> .cfi_startproc
> pushl%ebp
> .cfi_def_cfa_offset 8
> .cfi_offset %ebp, -8
> movl%esp, %ebp
> .cfi_def_cfa_register %ebp
> movl8(%ebp), %eax
> movl12(%ebp), %ecx
> addl$-8, %ecx
> sarl$8, %eax
> jmpLBB0_1
> LBB0_2:
> decl%ecx
> sarl%eax
> LBB0_1:
> testl%ecx, %ecx
> jneLBB0_2
> popl%ebp
> retl
> .cfi_endproc
>
>
> The code now uses decrement, and the initial ‘while’ variable value is 
> ‘ ammount-8’ as we could expect from the user source code. The whole 
> function is also shorter and it does not use potentially costly 
> push/pop instructions.
>
> I also want to point out, that this is not an isolated case at all. 
> This happens all the time for small loops where costs of several 
> variants are the same. In all cases, decrements in the user source 
> code are replaced by positive increments with necessarily strange 
> setup code (as shown in the first example above). Positive increments 
> are replaced by decrements, which also may force extra code generation 
> because the loop induction variable can’t be any longer used for 
> useful purposes. The remarkable thing is that by swapping the SCEV 
> expressions as we currently do, the compiler stops taking the user 
> source code into account as a hint on what to do, and instead it may 
> produce worse code in circumstances where the user may have already 
> chosen a naturally better implementation, which the compiler just 
> destroys by artificially replacing it by a different solution with a 
> supposed identical cost.
>
> Also, although this notably affects small loops with a few small-cost 
> variants, it also affects more complex loops that may have two or more 
> optimal solutions. Among the optimal solutions of the same cost, the 
> algorithm will always chose the first one in the list, so again the 
> one that will be chosen may be a reversed one with respect to the user 
> code, unless it is a really different solution.
>
> I do not understand the reason behind such reversing of loop 
> expressions, and that’s the question that I post. I understand that we 
> may want to always favour decrements (maybe for performance reasons in 
> some targets), or that we may want to prioritise increments (maybe to 
> keep induction variables usable in addressing modes in some targets), 
> or even that we may want to do different things for different targets. 
> But I do not quite understand the current implementation. It almost 
> seems to me a typo in the LLVM source code that nobody has come across 
> before. I have had my local copy of LLVM for some time with the code 
> set as in the comment, and I have yet to find a case where it produces 
> worse code than the original.
>
> So, I really am interested in starting some informed discussion about 
> this.
>
> John Lluch
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190615/33fafbb0/attachment.html>