[PATCH] Enhance loop rotation with existence of profile data in MachineBlockPlacement pass.

Tue Jul 7 16:13:52 PDT 2015

On Tue, Jul 7, 2015 at 4:00 PM, Cong Hou <congh at google.com> wrote:
> On Tue, Jul 7, 2015 at 3:35 PM, Xinliang David Li <davidxl at google.com>
> wrote:
>>
>> The already rotated inner loop needs to be treated as a single node
>> when participating in parent loop's rotation, otherwise it may end up
>> with wasting compile time and a suboptimal solution.
>>
>> Consider the loop nest:
>>
>> Entry
>> do { // outer loop
>>
>>  B0
>>  if (...) {
>>   do {    // inner loop
>>     B1
>>     if (..) {
>>      B2
>>     } else {
>>      B3
>>     }
>>     B4
>>   } while (..);  // inner loop
>>  }
>> else {
>>   B6
>> }
>> B5;
>> } while (...); // outer loop
>>
>> B7
>>
>>
>> The optimal inner loop layout is B3 B4 B1 B2;
>> The original outloop layout is :   B0 B6 (B3 B4 B1 B2) B5
>>
>> The optimal rotation of the outer loop should produce:    (B3 B4 B1
>> B2) B5 B0 B6, with the final layout be:
>>
>> Entry ((B3 B4 B1 B2) B5 B0 B6) B7
>>
>> However the current algorithm may produce  Entry (B5 B0 B6 (B3 B4 B1
>> B2)) B7  because it will have the same cost as the the optimal one.
>> The problem is that cost analysis needs to consider the edge from
>> inside of the inner loop to the top of the outer loop chain -- in the
>> bad solution, there is an edge from B4 to B5 whose cost should be
>> considered as part-3 cost.
>
>
> Why the layout  Entry (B5 B0 B6 (B3 B4 B1B2)) B7 is worse than Entry ((B3 B4
> B1 B2) B5 B0 B6) B7? In both cases, the inner loop chain is not split and is
> treated as a single node.
>

It matters when the outer loop is hot and the inner loop has
relatively small trip count.

> The edge from B4 to B5 is relatively cold comparing to edges in the inner
> loop. It won't be a fall-through as B4 is not the tail of the inner loop
> chain.

That is true, but treating B4 to B5 as a 'fall through' also enables
the layout such that layout in both paths of the outer loop are more
compact. In the suboptimal case, B6 is sitting in the middle of B5 and
 B0 which has lower cache utilization.

Given that the optimal outer loop layout does not actually reduce the
branch cost (but only icache reuse), I am fine leaving the
implementation as is assuming making it optimal requires a lot of
effort. I will be more comfortable if a loop nest example is added as
a test.

David

>
>
> Cong
>
>
>>
>>
>> David
>>
>>
>>
>>
>>
>> On Mon, Jul 6, 2015 at 10:22 AM, Cong Hou <congh at google.com> wrote:
>> > When the outer loop is rotated, the inner loop is already linked to
>> > other
>> > CFG nodes in the outer loop. So I think we won't have to adjust the
>> > current
>> > algorithm as we have already considered all costs the rotation may bring
>> > or
>> > reduce.
>> >
>> >
>> > thanks,
>> > Cong
>> >
>> > On Mon, Jul 6, 2015 at 9:57 AM, Xinliang David Li <xinliangli at gmail.com>
>> > wrote:
>> >>
>> >> Does the cost analysis work well for loop nest? After the inner loop
>> >> chain
>> >> is formed and rotated, it will be later be merged into the parent loop
>> >> chain. The cost analysis for the parent loop may need to be adjusted to
>> >> consider the inner loops that are already rotated.
>> >>
>> >> David
>> >>
>> >> On Tue, Jun 30, 2015 at 2:29 PM, Cong Hou <congh at google.com> wrote:
>> >>>
>> >>> Update the patch by adding two opt parameters that define the cost of
>> >>> misfetch and jump instruction, and use them when rotating loops.
>> >>>
>> >>>
>> >>> http://reviews.llvm.org/D10717
>> >>>
>> >>> Files:
>> >>>   lib/CodeGen/MachineBlockPlacement.cpp
>> >>>   test/CodeGen/X86/code_placement_loop_rotation.ll
>> >>>
>> >>> EMAIL PREFERENCES
>> >>>   http://reviews.llvm.org/settings/panel/emailpreferences/
>> >>>
>> >>> _______________________________________________
>> >>> llvm-commits mailing list
>> >>> llvm-commits at cs.uiuc.edu
>> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >>>
>> >>
>> >
>
>