[PATCH] D16829: An implementation of Swing Modulo Scheduling

Wed Feb 3 14:52:04 PST 2016

bcahoon added inline comments.

================
Comment at: lib/CodeGen/MachinePipeliner.cpp:3057-3059
@@ +3056,5 @@
+
+// Create branches from each prolog basic block to the appropriate epilog
+// block.  These edges are needed if the loop ends before reaching the
+// kernel.
+void SwingSchedulerDAG::addBranches(MBBVectorTy &PrologBBs,
----------------
materi wrote:
> bcahoon wrote:
> > materi wrote:
> > > I do not understand how this works when more than one iteration starts to execute in the prolog.
> > > 
> > > For example if the runtime trip count is 1, and 2 iterations are started in the prolog. Don't you miss executing some instructions from the only loop iteration?
> > > 
> > > If this is not a bug, maybe you can add a test case that shows how this works?
> > If two iterations are started in the prolog, then we generate two prolog basic blocks, and two epilog basic blocks.  At the end of each prolog basic block, we add a compare and branch to the corresponding epilog basic block (the fall through is to the next prolog block or the kernel).  This means that the first prolog block contains instructions from stage 0 and the second prolog block contains instructions from stage 1 and the 2nd iteration of stage 0.
> > 
> > In your example, with a run-time trip count of 1, the first prolog block branches to the last epilog block, and the instructions in the last epilog block are the first iteration of instructions scheduled in stage 1 and stage 2.
> > 
> > The swp-max.ll test case shows a pipelined schedule with 2 prolog and epilog blocks. 
> Thank you! I think I understand how it works now. The prolog and epilog blocks are not the "bundles" of the SWP prolog and epilog. The jump label for my trip count = 1 case is put in the middle of the first "epilog bundle".
> 
> But what if there are loop carried 0-latency dependences in the graph? This will force a certain order within the kernel to allow correct bundling in a later step. Can this be handled?
If I'm understanding your question, then yes - we do handle the case of a loop carried 0-latency instruction.  The order of the instructions in the prolog and epilog blocks is different than the order in the pipelined schedule.  The prolog/epilog instructions appear in the original instruction order (i.e., prior to pipelining), and they are grouped by the pipelined stage.

As an example, lets say there are 3 stages, numbered 0,1,2, so there will be two prolog blocks and two epilog blocks.  The first prolog contains instructions from stage 0 in the original order. The last epilog contains instructions from stages 1 and 2 in original order.  If the loop contains only 1 iteration, then the stage 0 instructions in the first prolog are executed, and control jumps to the last epilog block to execute the first iteration of instructions from stages 1 and 2.

In the second prolog, we first generate the instructions from stage 1 in the original order, and then stage 0 in original order.  In the second to last epilog, we generate instructions for stage 2 in the original order.  If the loop has 2 iterations, then the 2 prolog bocks execute instructions from stage 0 twice, and stage 1 one.  The 2 epilog blocks execute instructions from stage 2 twice and stage 1 once.

I hope this makes sense and answers your question correctly.  Let me know.

http://reviews.llvm.org/D16829