[llvm-commits] Patch to improve loop-simplify

Andrew Clinton ajclinto at gmail.com
Sun Nov 4 16:14:25 PST 2012


I'd just like to clarify, as stated in the initial submission of this
patch, that I'm NOT disabling the split of inner/outer loops in all
cases - only in cases where the separation results in an outer loop
that has no exit.  This is, I believe, a conclusive approach to
identifying when the original heuristic was misidentifying control
flow in a loop as a nested loop.  If there actually were two nested
loops, there would have been an exit from both the proposed inner and
outer loop blocks, and splitting into an inner/outer loop may result
in improved optimization opportunities as you describe - and my patch
does nothing to prevent this.

Andrew

On Sat, Nov 3, 2012 at 1:13 PM, Andrew Clinton <ajclinto at gmail.com> wrote:
> It's not clear to me how LICM would provide any benefit in hoisting
> redundancies out of conditional control flow - isn't the point of LICM
> to extract expressions from a loop so they can be executed once
> outside the loop?  If the contents of the loop are already expressed
> with conditionals, there wouldn't be any benefit to this optimization.
>
> If there was an invariant in the original loop that could be extracted
> by LICM, I can't see that splitting into an outer/inner loop would
> make this any easier to find.
>
> Do you have a test case that I could use to verify?
>
> Andrew
>
> On Fri, Nov 2, 2012 at 4:53 PM, Dan Gohman <dan433584 at gmail.com> wrote:
>> It's possible that I've missed something, but it's still not clear to me
>> that this patch is desirable. You proposed that conditionals are better than
>> loops, but that's not obviously true. As I mentioned before, consider
>> something like LICM; if the compiler converts multiple backedges into
>> multiple loops, there may be opportunities for hoisting invariants from the
>> inner loop. If the compiler leaves it as a conditional inside of a loop,
>> there's no easy way to eliminate similar redundancies. This is a case where
>> loops are demonstrably easier to analyze.
>>
>> Dan
>>
>> On Wed, Oct 31, 2012 at 3:42 PM, Andrew Clinton <ajclinto at gmail.com> wrote:
>>>
>>> Since I don't seem to be getting any responses, I wonder if some
>>> developers could chime in and let me know if there is something else I
>>> need to do do have my patch accepted?
>>>
>>> I believe I have followed the submission guidelines - used the correct
>>> coding style, added regression tests, and explained a few times the
>>> benefit of the patch, all of which have taken a few days of work.  It
>>> would be nice if this work could be rewarded with the patch being
>>> added to the software :)
>>>
>>> Anyone?
>>>
>>> On Tue, Oct 30, 2012 at 9:56 PM, Andrew Clinton <ajclinto at gmail.com>
>>> wrote:
>>> > I've updated the patch to work against the latest tree (see attached).
>>> >  There was also a related discussion on the llvm-dev list "SimplifyCFG
>>> > vs loops" where I've put some background on what this is needed for.
>>> >
>>> > On Fri, Jul 15, 2011 at 3:02 PM, Andrew Clinton <ajclinto at gmail.com>
>>> > wrote:
>>> >> Do you think this patch could be included in the current head?
>>> >>
>>> >> Andrew
>>> >>
>>> >> On Sun, Mar 13, 2011 at 8:50 PM, Andrew Clinton <ajclinto at gmail.com>
>>> >> wrote:
>>> >>> The updated patch (attached) eliminates the extra jmp in the generated
>>> >>> machine code.  It does this by inserting the backedge block directly
>>> >>> before the loop header rather than after the last backedge block.
>>> >>> This will result in basically the same block placement as if an outer
>>> >>> loop were extracted, but will not introduce a new loop.
>>> >>>
>>> >>> Andrew
>>> >>>
>>> >>> On Wed, Mar 9, 2011 at 9:33 AM, Andrew Clinton <ajclinto at gmail.com>
>>> >>> wrote:
>>> >>>> Here's a test case to show the unexpected introduction of nested
>>> >>>> loops:
>>> >>>>
>>> >>>> bool test_cond();
>>> >>>> int test()
>>> >>>> {
>>> >>>>  int val = 0;
>>> >>>>  while (test_cond())
>>> >>>>  {
>>> >>>>    if (test_cond())
>>> >>>>    {
>>> >>>>      val += test_cond();
>>> >>>>    }
>>> >>>>  }
>>> >>>>  return val;
>>> >>>> }
>>> >>>>
>>> >>>> 1. compile with clang: clang -S -emit-llvm test.C
>>> >>>> 2. opt -O3 -S test.s
>>> >>>>
>>> >>>> The optimized assembly has two loops, where the input program just
>>> >>>> had
>>> >>>> a single loop with a conditional inside it.
>>> >>>>
>>> >>>> On Tue, Mar 8, 2011 at 11:16 AM, Andrew Clinton <ajclinto at gmail.com>
>>> >>>> wrote:
>>> >>>>> The goal wasn't to improve the resulting code (although I didn't
>>> >>>>> want
>>> >>>>> it to degrade it either).  The purpose was to improve analysis, so
>>> >>>>> that when there is plain ordinary control flow inside a loop (that
>>> >>>>> happens to be represented as multiple backedges), it's not converted
>>> >>>>> to nested loops.  This conversion makes it very awkward to analyse
>>> >>>>> the
>>> >>>>> looping structure in my application.  I've worked around this by
>>> >>>>> writing my own "loop-merging" pass to undo the nesting
>>> >>>>> transformation
>>> >>>>> for loops that have no exit (that is not contained in another loop).
>>> >>>>> Perhaps, if this codegen problem with this patch can't be
>>> >>>>> surmounted,
>>> >>>>> I could submit the "loop-merging" pass in a separate patch.
>>> >>>>>
>>> >>>>> Looking at the IR, I see the following:
>>> >>>>>
>>> >>>>> With patch:
>>> >>>>> - "Inner loop" has 2 conditional branches
>>> >>>>> - "Outer loop" has 2 conditional branches and 1 unconditional branch
>>> >>>>>
>>> >>>>> Without patch:
>>> >>>>> - Inner loop has 2 conditional branches
>>> >>>>> - Outer loop has 2 conditional branches and 2 unconditional branches
>>> >>>>>
>>> >>>>> How is the IR more complex with the patch?
>>> >>>>>
>>> >>>>> On Tue, Mar 8, 2011 at 4:52 AM, Cameron Zwarich <zwarich at apple.com>
>>> >>>>> wrote:
>>> >>>>>> Upon closer inspection, the difference in codegen is reflected in
>>> >>>>>> the input that is fed to llc. With your patch both the inner and outer loops
>>> >>>>>> have a split backedge. Without your patch only the outer loop has a split
>>> >>>>>> backedge. It's not a coincidence that the block placement is better with the
>>> >>>>>> version that doesn't split the inner loop's backedge, and as far as I can
>>> >>>>>> tell there is no pass that is generally capable of repairing such a thing.
>>> >>>>>>
>>> >>>>>> What benefit does your patch bring? Is there some case where it
>>> >>>>>> improves the resulting code?
>>> >>>>>>
>>> >>>>>> Cameron
>>> >>>>>>
>>> >>>>>> On Mar 7, 2011, at 6:28 PM, Andrew Clinton wrote:
>>> >>>>>>
>>> >>>>>>> I've run a few more tests based on the IR listed below.  It
>>> >>>>>>> appears
>>> >>>>>>> that regardless of whether my patch is present, llc will generate
>>> >>>>>>> code
>>> >>>>>>> without the extra "jmp" for the old .ll and with the extra "jmp"
>>> >>>>>>> with
>>> >>>>>>> the new .ll.  So it appears that the patch is not responsible for
>>> >>>>>>> this
>>> >>>>>>> difference in the generated x86 assembly code.  I'm guessing that
>>> >>>>>>> the
>>> >>>>>>> blocks happen to fall in a nicer pattern for codegen in this
>>> >>>>>>> particular case, when an outer loop is split out.  Could you
>>> >>>>>>> commit
>>> >>>>>>> the patch?
>>> >>>>>>>
>>> >>>>>>> Andrew
>>> >>>>>>>
>>> >>>>>>> On Thu, Mar 3, 2011 at 5:59 PM, Cameron Zwarich
>>> >>>>>>> <zwarich at apple.com> wrote:
>>> >>>>>>>> That is a bit strange. There is another run of LoopSimplify in
>>> >>>>>>>> the backend for LoopStrengthReduce. Perhaps a problem is reintroduced there
>>> >>>>>>>> that is not cleaned up.
>>> >>>>>>>>
>>> >>>>>>>> Cameron
>>> >>>>>>>>
>>> >>>>>>>> On Mar 3, 2011, at 8:10 AM, Andrew Clinton wrote:
>>> >>>>>>>>
>>> >>>>>>>>> I'm not sure why the code generator is synthesizing an
>>> >>>>>>>>> additional
>>> >>>>>>>>> unconditional branch.  The IR is certainly simpler with this
>>> >>>>>>>>> patch.
>>> >>>>>>>>>
>>> >>>>>>>>> --------------------------------
>>> >>>>>>>>> With patch:
>>> >>>>>>>>>> opt -O3 not-nested-loop.ll -S
>>> >>>>>>>>>
>>> >>>>>>>>> ; ModuleID = 'not-nested-loop.ll'
>>> >>>>>>>>>
>>> >>>>>>>>> define i32 @_Z4testv() {
>>> >>>>>>>>> ; <label>:0
>>> >>>>>>>>>  %1 = tail call zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>  br i1 %1, label %.lr.ph, label %._crit_edge
>>> >>>>>>>>>
>>> >>>>>>>>> .lr.ph:                                           ; preds = %0,
>>> >>>>>>>>> %.backedge
>>> >>>>>>>>>  %val.02 = phi i32 [ %val.0.be, %.backedge ], [ 0, %0 ]
>>> >>>>>>>>>  %i.01 = phi i32 [ %tmp, %.backedge ], [ 0, %0 ]
>>> >>>>>>>>>  %tmp = add i32 %i.01, 1
>>> >>>>>>>>>  %2 = mul nsw i32 %tmp, %tmp
>>> >>>>>>>>>  %3 = icmp sgt i32 %2, 6
>>> >>>>>>>>>  br i1 %3, label %5, label %.backedge
>>> >>>>>>>>>
>>> >>>>>>>>> .backedge:                                        ; preds =
>>> >>>>>>>>> %.lr.ph, %5
>>> >>>>>>>>>  %val.0.be = phi i32 [ %8, %5 ], [ %val.02, %.lr.ph ]
>>> >>>>>>>>>  %4 = tail call zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>  br i1 %4, label %.lr.ph, label %._crit_edge
>>> >>>>>>>>>
>>> >>>>>>>>> ; <label>:5                                       ; preds =
>>> >>>>>>>>> %.lr.ph
>>> >>>>>>>>>  %6 = tail call zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>  %7 = zext i1 %6 to i32
>>> >>>>>>>>>  %8 = add nsw i32 %7, %val.02
>>> >>>>>>>>>  br label %.backedge
>>> >>>>>>>>>
>>> >>>>>>>>> ._crit_edge:                                      ; preds =
>>> >>>>>>>>> %.backedge, %0
>>> >>>>>>>>>  %val.0.lcssa = phi i32 [ 0, %0 ], [ %val.0.be, %.backedge ]
>>> >>>>>>>>>  ret i32 %val.0.lcssa
>>> >>>>>>>>> }
>>> >>>>>>>>>
>>> >>>>>>>>> declare zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>
>>> >>>>>>>>> --------------------------------
>>> >>>>>>>>> Without patch:
>>> >>>>>>>>>
>>> >>>>>>>>>> opt -O3 not-nested-loop.ll -S
>>> >>>>>>>>>
>>> >>>>>>>>> ; ModuleID = 'not-nested-loop.ll'
>>> >>>>>>>>>
>>> >>>>>>>>> define i32 @_Z4testv() {
>>> >>>>>>>>> ; <label>:0
>>> >>>>>>>>>  br label %.outer
>>> >>>>>>>>>
>>> >>>>>>>>> .outer:                                           ; preds = %6,
>>> >>>>>>>>> %0
>>> >>>>>>>>>  %i.0.ph = phi i32 [ 1, %0 ], [ %phitmp, %6 ]
>>> >>>>>>>>>  %val.0.ph = phi i32 [ 0, %0 ], [ %9, %6 ]
>>> >>>>>>>>>  br label %1
>>> >>>>>>>>>
>>> >>>>>>>>> ; <label>:1                                       ; preds =
>>> >>>>>>>>> %.outer, %3
>>> >>>>>>>>>  %indvar = phi i32 [ 0, %.outer ], [ %indvar.next, %3 ]
>>> >>>>>>>>>  %tmp1 = add i32 %i.0.ph, %indvar
>>> >>>>>>>>>  %2 = tail call zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>  br i1 %2, label %3, label %10
>>> >>>>>>>>>
>>> >>>>>>>>> ; <label>:3                                       ; preds = %1
>>> >>>>>>>>>  %4 = mul nsw i32 %tmp1, %tmp1
>>> >>>>>>>>>  %5 = icmp sgt i32 %4, 6
>>> >>>>>>>>>  %indvar.next = add i32 %indvar, 1
>>> >>>>>>>>>  br i1 %5, label %6, label %1
>>> >>>>>>>>>
>>> >>>>>>>>> ; <label>:6                                       ; preds = %3
>>> >>>>>>>>>  %7 = tail call zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>  %8 = zext i1 %7 to i32
>>> >>>>>>>>>  %9 = add nsw i32 %8, %val.0.ph
>>> >>>>>>>>>  %phitmp = add i32 %tmp1, 1
>>> >>>>>>>>>  br label %.outer
>>> >>>>>>>>>
>>> >>>>>>>>> ; <label>:10                                      ; preds = %1
>>> >>>>>>>>>  ret i32 %val.0.ph
>>> >>>>>>>>> }
>>> >>>>>>>>>
>>> >>>>>>>>> declare zeroext i1 @_Z9test_condv()
>>> >>>>>>>>>
>>> >>>>>>>>> On Thu, Mar 3, 2011 at 3:43 AM, Cameron Zwarich
>>> >>>>>>>>> <zwarich at apple.com> wrote:
>>> >>>>>>>>>> When I compile your first test case without this patch for X86,
>>> >>>>>>>>>> I see 2 conditional branches in the loop body and no unconditional branches.
>>> >>>>>>>>>> With this patch, there is an extra unconditional branch inside the loop.
>>> >>>>>>>>>> Unfortunately, I don't think this is cleaned up as well as you assume.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Cameron
>>> >>>>>>>>>>
>>> >>>>>>>>>> On 2011-03-02, at 11:56 PM, Andrew Clinton wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>>> The role of the loop-simplify pass, as described in the
>>> >>>>>>>>>>> header, is to
>>> >>>>>>>>>>> transform natural loops into a simpler form.  The goal of this
>>> >>>>>>>>>>> patch
>>> >>>>>>>>>>> is not to produce the minimum number of instructions but to
>>> >>>>>>>>>>> ensure
>>> >>>>>>>>>>> that additional (unnecessary) complexity is not introduced by
>>> >>>>>>>>>>> the
>>> >>>>>>>>>>> simplification.  I think this is a good change since running
>>> >>>>>>>>>>> the
>>> >>>>>>>>>>> natural loops analysis after this pass will then produce the
>>> >>>>>>>>>>> expected
>>> >>>>>>>>>>> loop nesting.  Without this patch, I have found it necessary
>>> >>>>>>>>>>> to write
>>> >>>>>>>>>>> a new pass that merges the nested loops.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> I believe that other passes will clean up extra branching (CFG
>>> >>>>>>>>>>> simplification), and PHI nodes will usually add runtime
>>> >>>>>>>>>>> complexity
>>> >>>>>>>>>>> only for the incoming blocks.  So I wouldn't expect runtime
>>> >>>>>>>>>>> performance to be impacted by this change.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Attached is the revised diff - thank you for the feedback!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Andrew
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On Thu, Mar 3, 2011 at 2:10 AM, Cameron Zwarich
>>> >>>>>>>>>>> <zwarich at apple.com> wrote:
>>> >>>>>>>>>>>> Is splitting an edge to create a unique loop backedge better
>>> >>>>>>>>>>>> than creating two loops? If the inner loop's backedge is more frequently
>>> >>>>>>>>>>>> taken, then you are trading an extra unconditional branch (with phis) on
>>> >>>>>>>>>>>> loop entry for an extra unconditional branch (with phis) along the inner
>>> >>>>>>>>>>>> loop's backedge. When is this a good trade?
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Cameron
>>> >>>>>>>>>>> <nestedloops.diff>
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>>



More information about the llvm-commits mailing list