[LLVMdev] RFC: Machine Instruction Bundle
Andrew Trick
atrick at apple.com
Wed Jan 11 14:18:36 PST 2012
On Jan 11, 2012, at 4:56 AM, Stripf, Timo (ITIV) wrote:
> Hi Evan,
>
> I just read your proposal and the following discussion for VLIW support and want to share my experience of writing a VLIW back-end for LLVM.
>
>
> I would not integrate the packetizer into the register allocator super class since it would reduce the flexibility for the back-end developer to add some optimization passes after the packetizer. Instead, I would add the packetizer as a separate pass. It is true that the packetizer must deal in that case with PHI and COPY nodes that are eliminated by the RA. The packetizer can simple group all PHI and COPY instruction into single bundles consisting of only one instruction.
>
> From my experience a simple packetizer that groups instruction into bundles (like the old IA-64 back-end did) without changing the order of the instructions produces bad code. Instead, a VLIW scheduler that directly outputs bundles produces better code. The current LLVM scheduler (at the end of the instruction selection pass) is not suitable to generate bundled instructions since it operates on scheduling units for glued instructions.
> However, the post-RA scheduler in combination with a VLIW-aware hazard recognizer can be used before RA to bundle and schedule instructions for VLIW architectures. Only small modifications within the post-RA scheduler classes to support virtual registers are necessary.
What you're describing is how we expect some packetizers to be implemented. Although it's really up the the person implementing the packetizer whether or not to integrate it with scheduling. The point is that a framework will support both scheduling and bundling before RA (and after coalescing).
> I also would not include packet finalization into the register allocator super class since also the following pre- and epilog code insertion (PECI) pass adds extra instruction into the instruction list. So I would add the packet finalization after pre- and epilog code insertion. Both the RA and PECI can add its instruction into single bundles that can be integrated into larger bundles within packet finalization. For packet finalization it also makes sense to perform a post-ra VLIW scheduling.
I think Evan referred to a regalloc "super pass". This means nothing more than a set of passes that all require and preserve liveness information.
Once physical registers are assigned and we throw away vreg liveness, then passes can view bundles as individual instructions. That's what I thin of as "packet finalization". Nothing really prevents rebundling though. The late bundler just may not have as much freedom.
-Andy
>
> Timo
>
>
> Von: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] Im Auftrag von Evan Cheng
> Gesendet: Freitag, 2. Dezember 2011 21:40
> An: LLVM Dev
> Betreff: [LLVMdev] RFC: Machine Instruction Bundle
>
> Machine Instruction Bundle in LLVM
>
> Hi all,
>
> There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.
>
> Design Criteria
>
> 1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead.
> 2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late.
> 3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE).
> 4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle.
>
> Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices?
>
> Bundle Representation
>
> 1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs.
> 2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle.
>
> The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this:
>
> --------------
> | Bundle | -------
> -------------- \
> | ----------------
> | | MI |
> | ----------------
> | |
> | ----------------
> | | MI |
> | ----------------
> | |
> | ----------------
> | | MI |
> | ----------------
> |
> --------------
> | Bundle | ------
> -------------- \
> | ----------------
> | | MI |
> | ----------------
> | |
> | ----------------
> | | MI |
> | ----------------
> | |
> | …
> |
> --------------
> | Bundle | ------
> -------------- \
> |
> ...
>
>
> This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks. Using #2, we will add a bit to MI that indicates whether it is part of a bundle.
>
> ----------------
> | MI * | (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle)
> ----------------
> |
> ----------------
> | MI * |
> ----------------
> |
> ----------------
> | MI | (no bit, this is the end of the bundle)
> --------------
> |
> ----------------
> | MI * | (* a new bundle)
> ----------------
> |
> ----------------
> | MI |
> ----------------
> |
> ...
>
> We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles.
>
> We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks.
>
> ----------------
> | MI | (just a MI)
> ----------------
> |
> ----------------
> | MI * | (* Start of Thumb2 IT block)
> ----------------
> |
> ----------------
> | MI * |
> ----------------
> |
> ----------------
> | MI | (last MI in the block)
> ----------------
> |
> ----------------
> | MI |
> ----------------
> |
> ...
>
> This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available.
>
> Properties of Bundle
>
> If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g.
>
> -------------------------
> | r0 = op1 r1, r2 |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>
> r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def).
>
> -------------------------
> | r0 = op1 r1, r2 |
> | r3 = op2 r0, #c |
> -------------------------
> ...
> = op3 r0,
>
> r0 is a def, not a use.
>
> What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation.
>
> ----------------
> | Bundle * | (A MI with special opcode "Bundle")
> ----------------
> |
> ----------------
> | MI * |
> ----------------
> |
> ----------------
> | MI * |
> ----------------
> |
> ----------------
> | MI | (no bit, this is the end of the bundle)
> --------------
> |
> ----------------
> | Bundle * | (a new bundle)
> ----------------
> |
> ----------------
> | MI * |
> ----------------
> |
> ----------------
> | MI |
> ----------------
> |
> ...
>
> The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later.
>
> Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles).
>
> Packetizing
>
> The current MI flow looks like this:
>
> 1. DAG to MI lowering (and pre-RA schedule)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
> 3a. De-ssa (2-address, phi slim)
> 3b. Coalescing
> 3c. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Post-RA scheduling
>
> In the hopefully not very distant future it should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
> 3a. De-ssa (2-address, phi slim)
> 3b. Coalescing
> 3c. Pre-RA scheduling
> 3d. Actual register allocation
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>
> The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions.
>
> Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this.
>
> 1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains.
>
> e.g. Now vr0 has two defs!
> defs: vr0<dead>, vr3, uses: vr1, vr2
> ----------------------------
> | vr0 = op1 vr1, vr2 |
> | vr3 = op2 vr0<kill>, #c |
> ----------------------------
>
> 2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets).
>
> So the overall flow should look like this:
>
> 1. DAG to MI lowering (no scheduling!)
> 2. MI optimizations (LICM, CSE, etc.)
> 3. Register allocation super pass
> 3a. De-ssa (2-address, phi slim)
> 3b. Coalescing
> 3c. Pre-scheduling packetization (optional)
> 3d. Pre-RA scheduling (or integrated packetization)
> 3e. Post-scheduling packetization (optional)
> 3f. Actual register allocation
> 3g. Packet finalization
> 4. Post-RA optimizations
> 5. PEI
> 6. Re-schedule restores, copies
>
> Lowering Bundles to MCInst
>
> There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g.
>
> -------------------------
> | r0 = op1 r1, r2 |
> | r3 = op2 r0, #c |
> -------------------------
>
> =>
>
> MCInst: op1 r0, r1, r2, op2, r3, r0, #c
> or
> MCInst: op1 op2 r0, r1, r2, r3, r0, #c
>
> What's Next?
>
> I am hoping to find some time to implement the followings in the near future:
> 1. Add BUNDLE opcode
> 2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo().
> 3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator.
> 4. Add MachineInstr API to check for instruction properties and switch existing code over.
> 5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI.
> 6. Switch Thumb2 IT block to using MI bundles.
> 7. Add interface for targets to register their own packetization passes.
>
> I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-)
>
> In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome!
>
> Evan
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120111/242df13c/attachment.html>
More information about the llvm-dev
mailing list