[llvm-dev] MachinePipeliner refactoring

Mon Jul 29 04:33:49 PDT 2019

Hi Brendon,

Thanks so much for this. The explanation really helped. Please do take a
look at the new version of my refactor - I think it's a load cleaner than
before and now *does* handle all of the weird phi special cases. It's great
that the Hexagon lit tests contain so many of these!

PTAL. It's not 100% ready, but if you like the direction I think this is
something we can put in under a flag while we finish out the last few
issues.

James

On Wed, 17 Jul 2019 at 20:15, Brendon Cahoon <bcahoon at quicinc.com> wrote:

> Hi James,
>
>
>
> My initial attempt at generating the prolog, kernel, and epilog was quite
> simple as well. But, as the pipeliner attempted to work on more complex
> loop structures, the logic got complicated pretty quickly.  This is most
> evident in the generateExistingPhis function, which is a mess.  I wanted
> this function to be general, but it evolved over time into a sequence of
> special cases.
>
>
>
> > To dig a little deeper, one thing I don't understand right now (and this
> is inhibiting me from making effective refactoring) is what it actually
> means to "schedule" a loop-carried PHI.
>
>
>
> This is a good question. I don’t think the Phis need to be scheduled, but
> even if the Phis aren’t scheduled, we still need to generate new Phis in
> the prolog and epilog. And, we need to refer to the original Phi to find
> the incoming values, and to maintain the map from original names to new
> names.
>
>
>
> The stage that a Phi is scheduled in is related to the stage(s) where the
> uses are scheduled, and the stage for the loop definition. The code uses
> the Phi stage to determine the total number of new Phis to generate. We
> probably could use the stages for the definition and uses to calculate
> values that we use the Phi for – for example, the number of stages, or
> copies, needed between the definition and a use.
>
>
>
> If a Phi has the loop definition scheduled in stage 1, and uses of the Phi
> scheduled in stages 1 and 2, then the Phi will be scheduled in stage 1 most
> likely.  The kernel will require two new Phis to represent the values used
> and generated in stages 1 and 2. We probably could ignore the stage in
> which the Phi is scheduled, and compute the number of new Phis based upon
> the schedule stages for the loop definition and uses.
>
>
>
> I don’t believe any of this is described in the SMS paper though.
>
>
>
> > The big thing I'm trying to wrap my head around is the logic in the
> generateExistingPhis() function
>
>
>
> Yeah, me too sometimes.  The role of this function is to generate new Phis
> in the kernel and epilogs for each Phi that occurred in the original loop.
> This function also updates the mapping between old and new names, VRMap,
> and updates register names used in other instructions.  The function
> generates the Phis for the kernel, and epilogs at the start.  It needs to
> determine:
>
>    - How many new Phis to generate based upon the schedule stages for the
>    definition and uses.
>    - Which stage contains the incoming values, and which block is holding
>    the value for that stage
>    - Rewrite previously scheduled uses, if needed
>
>
>
> Determining the number of new Phis is straightforward, most of the time.
> There are some special cases when a Phi use is scheduled in an earlier
> cycle, but later stage, than the Phi loop definition.  For example,
>
>
>
>    = a                  cycle 0, stage 1
>
>    a = phi(b, c)  cycle 1, stage 0
>
>    c =                  cycle 2, stage 0
>
>
>
> In this case, only 1 new Phi is needed.  If the use of a is scheduled
> after the definition of c, but still in stage 1, then 2 Phis are needed.
>
>
>
> Also, there are some special cases when generating the Phis in the epilog.
> This happens because the code computes the maximum number of Phis needed,
> but that value decreases as the number of epilogs generated increases.
>
>
>
> Computing the block that contains the incoming values is most of the code
> in this function. Much of this code attempts to handle various related to
> Phis that refer to other Phis.  For example,
>
>
>
>    a = phi(b, c)
>
>    c = phi(d, e)
>
>       = a
>
>       = c
>
>    e =
>
>
>
> The number of dependent phis can be more than 2.
>
>
>
> The calculation for the correct block depends on if we’re generating code
> for the kernel or epilog too.  For example, in the kernel the second Phi
> operand always comes from the kernel, but in the epilog, the second Phi
> operand may come from the kernel, or a previous epilog block.  The epilog
> block will never have a Phi that depends on a Phi from the same block (like
> the case shown above).
>
>
>
> In the simplest case, the stage with the incoming value is the difference
> between the schedule stage of the use and definition.  The extra
> computation occurs when the Phi or loop definition is not scheduled in
> stage 0. Then, adjustments are made to find the stage and block with the
> correct value.  If a Phi loop definition is scheduled in stage 1, then the
> first occurrence of the instruction appears in the kernel (for a 2 stage
> pipeline). This also means that a Phi is needed in the epilog for that
> value. When extended to schedules with multiple stages, it calculations get
> trickier.  For example, in a 3 stage pipeline, a Phi loop definition is
> scheduled in stage 1 appears in the 2nd prolog, the kernel, and the 2nd
> epilog.  If scheduled in stage 2, it appears in the kernel, 1st epilog
> and 2nd epilog.  When rewriting the uses of that Phi, the calculation of
> the block with the correct value depends its schedule stage (and if the use
> appears before/after the Phi in the schedule).
>
>
>
> Let me try to enumerate some more of the special cases in the code.
> You’re correct in that we just need to answer a fairly straightforward
> question about which stage and block contains the value we’re looking for.
>
>
>
> Thanks,
>
> Brendon
>
>
>
> *From:* James Molloy <james at jamesmolloy.co.uk>
> *Sent:* Tuesday, July 16, 2019 9:02 AM
> *To:* Brendon Cahoon <bcahoon at quicinc.com>
> *Cc:* Jinsong Ji <jji at us.ibm.com>; Hal Finkel <hfinkel at anl.gov>; LLVM Dev
> <llvm-dev at lists.llvm.org>
> *Subject:* [EXT] Re: MachinePipeliner refactoring
>
>
>
> Hi Brendon,
>
>
>
> Thanks for the in-depth reply!
>
>
>
> > Loop carried Phis are hard.
>
>
>
> I think the whole thing can be summarized like this :) The data flow
> between prolog, kernel and epilog would be fairly trivial to handle without
> these (and this is demonstrated by my prototype being ~300 lines of code).
>
>
>
> To dig a little deeper, one thing I don't understand right now (and this
> is inhibiting me from making effective refactoring) is what it actually
> means to "schedule" a loop-carried PHI. In my naive head, things would be a
> lot simpler if all phis were scheduled in stage zero. Given that they are
> free, is the idea behind them being in a non-zero stage purely for register
> allocation? (I'm aware the answer to this may well be "read the SMS
> literature more carefully"!)
>
>
>
> The big thing I'm trying to wrap my head around is the logic in the
> generateExistingPhis() function. I feel that everything else seems fairly
> easy to refactor; it's the logic here that's tied closely to the concepts
> of prologs and epilogs. It feels like we should be able to answer the
> question "Given this loop-carried PHI, what instruction in what stage (as
> an offset from the current stage) generates this value?"
>
>
>
> If we could answer that question, generating the tree of phis to plumb the
> values through seems, while not trivial, an orthogonal task.
>
>
>
> Thanks for answering my newbie questions!
>
>
>
> James
>
>
>
> On Tue, 16 Jul 2019 at 01:35, Brendon Cahoon <bcahoon at quicinc.com> wrote:
>
> Hi James,
>
>
>
> I also think that refactoring the code generation part is a great idea.
> That code is very complicated and difficult to maintain. I’ve wanted to
> rewrite that code for a long time, but just have never got to it.  There
> are quite a few edge cases to handle (at least in the current code). I’ll
> take a deeper look at your patch.  The abstractions that you mention, Stage
> and Block, are good ones. I think you’ll need to account for the cycle, or
> position within a Stage as well. It is a complex problem with a lot
> different edge cases when dealing with Phis, though I think they can be
> dealt with much better than the existing code.
>
>
>
> Generating code for the 3 main parts, the prolog, kernel, and epilog, can
> be challenging because each has a unique difference that made is hard to
> generalize the code.  For example, the prologs will not have an Phis, but
> the code generation needs to be aware of when the Phis occur in the
> sequence in order to get the correct remapped name. In the kernel, values
> may come from a previous iteration of the kernel. In the epilog, values can
> come from the same epilog block, the kernel, or a prolog block.
>
>
>
> It would be great to be able to generate the compact representation that
> you describe, so that a prolog/epilog do not need to be generated. Hexagon
> has a special hardware loop instruction, spNloop, specifically for this.
> Although, the compiler does not generate that instruction currently.
> Perhaps it would be easier to generate this form first, and then convert it
> to a prolog(s), kernel, epilog(s) sequence.
>
>
>
> There are a lot of subtle rules when dealing with the Phi instructions
> depending on when they are scheduled relative to uses. There are two
> different factors to account for – the stage and cycle when the Phi is
> scheduled.  For example, a use of a Phi can be scheduled at an earlier
> cycle, but later stage. Several different combinations can occur, and each
> use of the Phi can have a different stage/cycle.  Maintaining the map
> between new and old virtual registers depends on the stage, cycle, and the
> current block (prolog, kernel, or epilog).
>
>
>
> The epilogs are generated in reverse order due to the way the instructions
> need to be laid out. The first epilog block after the kernel is going to
> have instructions scheduled in the last stage only, then the next epilog
> block, if needed, will have instructions from LastStage-1 first, then the
> LastStage.
>
>
>
> Let’s say we have instructions x, y, z, that are scheduled in different
> iterations/stage, x in stage0, y in stage1, and z in stage2, and x
> generates a value used by y, which generates a value used by z. The
> generated code will look like:
>
> Prolog 0 (this block ends with a conditional jump to Epilog 0):
>
>   X
>
> Prolog 1 (this block ends with a conditional jump to Epilog 1):
>
>   Y
>
>   X
>
> Kernel:
>
>   Z, Y, X
>
> Epilog 0:
>
>   Z
>
> Epilog 1:
>
>   Y
>
>   Z
>
>
>
> The instructions in the prolog and epilog are generate in original program
> order within each stage, but in the last epilog stage, the instruction Y
> generates a value used by the subsequent Z.
>
>
>
> The reduceLoopCount function is needed for an architecture that has a
> hardware loop instruction like Hexagon’s LOOP0.  For each prolog generated,
> the loop counter in the instruction needs to be decremented. This could
> have been done only in the last prolog block, but it’s done every time a
> prolog block is generated instead (initially, I thought that would be
> easier since the prolog generation code wouldn’t need to know which block
> was the last one).  But, later, I added code to handle the case when the
> loop can be removed completely. It seems reasonable to change this so that
> it’s updated only once, especially since find the loop instruction can
> require searching through the CFG.
>
>
>
> Loop carried Phis are hard. Initially, I tried handling the cases for
> adding new Phis and existing Phis with one function. But, it turned out
> that adding new Phis, when a instruction use is generated in an earlier
> cycle, but later stage, doesn’t cause as many special cases as handling the
> different edge cases for an existing Phis.  There are less combinations
> when adding a new Phis.
>
>
>
> Yes, you’re correct about the use of “iteration” vs. “stage” (I even used
> them together earlier).  The use of term stage comes from the paper, but I
> often use iteration when talking to others about software pipelining,
> especially if they have read the paper.  I agree that we should be more
> consistent with using the terms.
>
>
>
> Let me think some more about the questions you’ve asked. I do think it’s
> worthwhile to spend some time on creating a good abstraction and improving
> the code. I appreciate that you’ve started to think/work on this problem,
> so I’d like to help out to pursue this direction.
>
>
>
> Thanks,
>
> Brendon
>
>
>
>
>
> *From:* James Molloy <james at jamesmolloy.co.uk>
> *Sent:* Monday, July 15, 2019 11:05 AM
> *To:* Jinsong Ji <jji at us.ibm.com>
> *Cc:* Brendon Cahoon <bcahoon at quicinc.com>; Hal Finkel <hfinkel at anl.gov>;
> LLVM Dev <llvm-dev at lists.llvm.org>
> *Subject:* [EXT] Re: MachinePipeliner refactoring
>
>
>
> Hi Jingsong,
>
>
>
> Thanks for testing out the prototype! I'm not surprised there are errors
> in that version; it's not 100% ready yet (mainly a demonstrator for the
> path I'd like to take). I'm really trying to work out if it makes sense to
> try and incrementally get from the current state of the world to that
> prototype incrementally, or if it's just so far away that a full-on
> refactor (or two code paths) is required. I suspect only Brendan really has
> the context to give insight there!
>
>
>
> James
>
>
>
> On Mon, 15 Jul 2019 at 16:32, Jinsong Ji <jji at us.ibm.com> wrote:
>
> Hi James:
>
> Personally, I like the idea of refactoring and more abstraction,
> But unfortunately, I don't know enough about the edges cases either.
>
> BTW: the prototype is still causing quite some Asseertions in PowerPC -
> some nodes are not generated in correct order.
>
>
> Best,
>
> Jinsong Ji (纪金松), PhD.
>
> XL/LLVM on Power Compiler Development
> E-mail: jji at us.ibm.com
>
> [image: Inactive hide details for James Molloy ---07/15/2019 06:16:22
> AM---Hi Brendan (and friends of MachinePipeliner, +llvm-dev for o]James
> Molloy ---07/15/2019 06:16:22 AM---Hi Brendan (and friends of
> MachinePipeliner, +llvm-dev for openness), Over the past week or so I've
>
> From: James Molloy <james at jamesmolloy.co.uk>
> To: LLVM Dev <llvm-dev at lists.llvm.org>, jji at us.ibm.com,
> bcahoon at quicinc.com, Hal Finkel <hfinkel at anl.gov>
> Date: 07/15/2019 06:16 AM
> Subject: [EXTERNAL] MachinePipeliner refactoring
> ------------------------------
>
>
>
>
> Hi Brendan (and friends of MachinePipeliner, +llvm-dev for openness),
>
> Over the past week or so I've been attempting to extend the
> MachinePipeliner to support different idioms of code generation. To make
> this a bit more concrete, there are two areas where the currently generated
> code could be improved depending on architecture:
>
>   1) The epilog blocks peel off the final iterations in reverse order.
> This means that the overall execution of loop iterations isn't in a
> perfectly pipelined order. For architectures that have hardware constructs
> that insist on a first-in-first-out order (queues), the currently generated
> code cannot be used.
>   2) For architectures (like Hexagon) that have dedicated predicate
> register files, we can generate a compact representation of the loop by
> predicating stages of the loop kernel independently. In this case we can
> either have a prolog, epilog, or neither (wrapping the prolog and epilog
> inside the kernel by using PHIs of predicates).
>
> At the moment, a lot of the code generation helper code in
> MachinePipeliner is tightly fit to its current code generation strategy
> ("If we're in the epilog, to this, else do this"). I'm keen to try and make
> some of the complex calculations it does, such as where PHIs should come
> from, more abstract so they can be reused and composed.
>
> https://reviews.llvm.org/D64665 is my current best-effort. This generates
> perfect code for PowerPC, but causes a load of problems for Hexagon. It's
> become apparent that I don't know enough about some of the edge cases in
> the MachinePipeliner code to refactor this from scratch. I'm therefore
> looking for direction in factoring in an incremental fashion.
>
> I think there are a couple of areas that I'd like to change, and I'd
> appreciate your ideas and opinions because I clearly don't know enough
> about the edge cases here.
>
>   a) TII->reduceLoopCount() is hard to understand. Understanding the
> intended semantics of this hook from the documentation, I've found, is
> hard. Its use appears to be strongly fit to Hexagon (there is even a
> comment about the removal of LOOP0 in the MachinePipeliner target agnostic
> code, which really shouldn't be there). Why it's called multiple times I
> don't understand (why can't we just call it once with the total number of
> iterations to peel?).
>   b) Understanding how loop-carried PHIs are linked together is really
> hard. There are two functions dedicated to this with many edge cases, which
> are linked with the prolog and epilog schedule. It'd be great to somehow
> factor these such that they are independent of the code generation
> strategy. Note that this is really important for some of the code gen
> strategies I mention at the beginning, because loop-carried PHIs in this
> case may actually end up being selects or uses of predicated instructions.
>   c) There is a slight conflation of "iteration" and "stage" in the
> documentation that makes it hard to follow what VRMap contains and the
> invariants between functions.
>
> My intent in D64665 was to create two abstractions: "Stage" and "Block".
> Instead of referring to stages by index (VRMap), each Stage would take a
> prior Stage as input. Stages are contained inside Blocks, which handles
> predecessors and successors. I feel that arranging the code generation in
> this CFG-like way will make the flow of data much easier to analyze. Of
> course, a Stage doesn't just depend on a prior Stage - their loop carried
> inputs could come from any other Stage (and while I think I understand how
> this works, I clearly don't get all the edge cases).
>
> What do you think of this abstraction? do you think it's doomed to failure
> because it's too simplistic to cover all the cases?
>
> Do you have any suggestions of areas where we can start to factor out
> without a large-scale code breakage? I'm finding this hard to get my teeth
> into as the proposed code structure is so different from its current form.
>
> Thanks for any opinions or suggestions!
>
> Cheers,
>
> James
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190729/5c754c7e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190729/5c754c7e/attachment.gif>