[LLVMdev] RFC: Machine Instruction Bundle

Fri Dec 2 12:40:02 PST 2011

Machine Instruction Bundle in LLVM

Hi all,

There have been quite a bit of discussions about adding machine instruction bundle to support VLIW targets. I have been pondering what the right representation should be and what kind of impact it might have on the LLVM code generator. I believe I have a fairly good plan now and would like to share with the LLVM community.

Design Criteria

1. The bundle representation must be light weight. We cannot afford to add significant memory or compile time overhead.
2. It must be flexible enough to represent more than VLIW bundles. It should be useful to represent arbitrary sequence of instructions that must be scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random instruction sequences that are currently modeled as pseudo instructions that are expanded late.
3. Minimize the amount of changes required in the LLVM code generator, especially in target independent passes. It must minimize code duplication (i.e. we don't want code snippets that search for bundle start / end like all the code in the backend that skip over DBG_VALUE).
4. The representation should make it easy for new code to be oblivious of bundles. That is, MI passes should not have to check whether something is a bundle.

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right away. We don't want MachineBasic block to keep a list of MachineInstrBundles since it will require massive amount of code change. So what are the choices?

Bundle Representation

1. A nested MachineInstr: This is the most natural (meaning it looks most like the real HW bundle) representation. It has the nice property that most passes do not have to check if a MI is a bundle.The concern here this can add significant memory overhead if this means adding a ilist or SmallVector field to keep bundled MIs.
2. Add a bit to MachineInstr: The bit means the next MI in the list is part of the same bundle. This is very light weight. However it requires many passes to check wether a MI is part of a bundle.

The solution is a combination of both #1 and #2. Conceptually we want a representation that looks like this:

--------------
|  Bundle    | -------
--------------        \
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |
--------------
|  Bundle    | ------
--------------       \
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |           ----------------
         |           |      MI      |
         |           ----------------
         |                   |
         |                  …
         |
--------------
|  Bundle    | ------
--------------       \
         |
        ...

This is #1, a series of nested MI's. However, we are going to store the instructions in the same way as it's done right now, i.e. a list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to MI that indicates whether it is part of a bundle.

                     ----------------
                     |      MI    * |   (* bit indicates next MI is "glued" to this MI, i.e. in the same bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |      MI    * |   (* a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

We are going to hide the complexity in the MachineBasicBlock::iterator instead. That is, the iterator will be changed to visit only the *top level* instructions (i.e. first instruction in each bundle). We will add another iterator that allows client to visit all of the MIs for those passes that want to look into bundles.

We can use the same representation for arbitrary sequence of instructions that cannot be broken up. e.g. Thumb2 IT blocks.

                     ----------------
                     |      MI      |   (just a MI)
                     ----------------
                            |
                     ----------------
                     |      MI    * |   (* Start of Thumb2 IT block)
                     ----------------
                            |
                     ----------------
                     |      MI    * | 
                     ----------------
                            |
                     ----------------
                     |      MI      |   (last MI in the block)
                     ----------------
                            |
                     ----------------
                     |      MI      | 
                     ----------------
                            |
                           ...

This representation can support VLIW (where top level MI's are all start of bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also very cheap since the "Flags" field has plenty of free bits available.

Properties of Bundle

If MI passes can consider each bundle as a single unit, then how are they going to examine properties (i.e. flags and operands) of a MI bundle? Conceptually a the properties of a bundle is the union of the properties of all the MIs inside the bundle. So a bundle reads all the inputs that the individual MIs read and it defines all the outputs of the individual MIs. However, this is not correct when there are intra-bundle dependencies. e.g.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

r0 should not be considered as a source on the bundle since it's defined inside the bundle and its live range does not extend beyond it. Instead, r0 is a clobber (i.e. dead def).

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0, #c       |
-------------------------
 ...
     = op3 r0, 

r0 is a def, not a use.

What does this mean? It means in order for passes to operate on a bundle at a time, it must be able to visit all the defs and uses of a bundle. We have established that computing the defs and uses of a bundle is not as trivial as taking the union. This is certainly not something we want to re-compute every time! This requires a slight change to the bundle representation.

                     ----------------
                     |    Bundle  * |   (A MI with special opcode "Bundle")
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI    * |
                     ----------------
                             |
                     ----------------
                     |      MI      |    (no bit, this is the end of the bundle)
                     --------------
                             |
                     ----------------
                     |    Bundle  * |   (a new bundle)
                     ----------------
                             |
                     ----------------
                     |      MI    * | 
                     ----------------
                             |
                     ----------------
                     |      MI      |
                     ----------------
                             |
                            ...

The pseudo bundle instructions should be used to capture properties of the bundle. When a bundle is finalized the packetizer must add source and def operands to the pseudo bundle instruction. More on this later.

Other properties, such as mayLoad, mayStore, are static properties associated with opcodes. They cannot be copied. We will add APIs to examine properties of MIs which will do the *right thing* for bundles (i.e. look into MIs in bundles).

Packetizing

The current MI flow looks like this:

1. DAG to MI lowering (and pre-RA schedule)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Post-RA scheduling

In the hopefully not very distant future it should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-RA scheduling
   3d. Actual register allocation
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

The current proposal is for "packetization" to be done as part of the "RA super pass". Early MI optimization passes such as LICM do not benefit from operating on bundles. Furthermore, the packetizer should not have to know how to deal with copies which may later be coalesced, phi nodes, or other copy like pseudo instructions.

Packetization should be done in two phases. The first part decides what MIs should be bundled together and it add the "bits" which glued MIs together. This can be done either before pre-RA scheduling. The second part of the packetization should only be done after register allocation is completed. There are two very important reason for this.

1. Packet finalization *must* add source and def operands to the "Bundle" pseudo MI. This allows all later passes to handle they transparently. However, we do not want to do this before register allocation is complete. Otherwise it introduces new defs and uses of virtual registers and that mess up MachineRegisterInfo def-use chains.

e.g. Now vr0 has two defs!
defs: vr0<dead>, vr3, uses: vr1, vr2
----------------------------
| vr0 = op1 vr1, vr2       |
| vr3 = op2 vr0<kill>, #c  |
----------------------------

2. During register allocation, more identity copies will be eliminated while loads, stores, copies, re-materialized instructions will be introduced. It makes sense for the second part of packetization to try to fill these new instructions into empty slots (for VLIW like targets).

So the overall flow should look like this:

1. DAG to MI lowering (no scheduling!)
2. MI optimizations (LICM, CSE, etc.)
3. Register allocation super pass
   3a. De-ssa (2-address, phi slim)
   3b. Coalescing
   3c. Pre-scheduling packetization (optional)
   3d. Pre-RA scheduling (or integrated packetization)
   3e. Post-scheduling packetization (optional)
   3f. Actual register allocation
   3g. Packet finalization
4. Post-RA optimizations
5. PEI
6. Re-schedule restores, copies

Lowering Bundles to MCInst

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle should be concatenated into a single MCInst by storing opcodes as integer operands. e.g.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0, #c       |
-------------------------

=>

MCInst: op1 r0, r1, r2, op2, r3, r0, #c
or
MCInst: op1 op2 r0, r1, r2, r3, r0, #c

What's Next?

I am hoping to find some time to implement the followings in the near future:
1. Add BUNDLE opcode
2. MachineInstr class changes: new bit, changes to methods such as eraseFromParent(), isIdenticalTo().
3. Change MachineInstr::iterator to skip over bundled MIs. Rename old iterator.
4. Add MachineInstr API to check for instruction properties and switch existing code over.
5. Add API to form a bundle. It would compute the proper def's and use's and add MachineOperands to the bundle MI.
6. Switch Thumb2 IT block to using MI bundles.
7. Add interface for targets to register their own packetization passes.

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I also would not cry if someone beats me to #6 (or actually any of the tasks. :-)

In the longer term, I would like to see a target independent packetization pass (I believe one is being reviewed). I would also like to see a target independent interface for pre-scheduling optimizations that form instruction sequences (e.g. macro-fusion). Patches welcome!

Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111202/89337255/attachment.html>