[LLVMdev] RFC: Machine Instruction Bundle

Sergei Larin slarin at codeaurora.org
Mon Dec 5 08:59:20 PST 2011


Yes it is. Or it could be. All I am saying - if we make this implied
assumption (that the order of instructions in the list == semantical order
in the bundle)  it needs to be explicitly stated. Otherwise a pass developer
could really treat a bundle as truly "parallel" entity potentially causing
nasty things. 

 

Sergei Larin

 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

 

From: Evan Cheng [mailto:evan.cheng at apple.com] 
Sent: Friday, December 02, 2011 6:07 PM
To: Sergei Larin
Cc: 'LLVM Dev'
Subject: Re: [LLVMdev] RFC: Machine Instruction Bundle

 

 

On Dec 2, 2011, at 2:41 PM, Sergei Larin wrote:





. and yes, one more thing. On some architectures it might be desirable to
know the _order_ of instructions in the packet. That is a bit trickier..

 

Isn't that just the order of the instructions in the list? I don't see
anything that prevents getting the order of instructions. It might require
iterator over MIs in the packet. But for targets like VLIW, the # of
instructions should be small. Also note, passes that require this kind of
information should aggregate and maintain the data itself. Information like
schedule cycle belongs in "schedule unit", not in MI.

 

Evan





 

--

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

 

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Evan Cheng
Sent: Friday, December 02, 2011 2:40 PM
To: LLVM Dev
Subject: [LLVMdev] RFC: Machine Instruction Bundle

 

Machine Instruction Bundle in LLVM

 

Hi all,

 

There have been quite a bit of discussions about adding machine instruction
bundle to support VLIW targets. I have been pondering what the right
representation should be and what kind of impact it might have on the LLVM
code generator. I believe I have a fairly good plan now and would like to
share with the LLVM community.

 

Design Criteria

 

1. The bundle representation must be light weight. We cannot afford to add
significant memory or compile time overhead.

2. It must be flexible enough to represent more than VLIW bundles. It should
be useful to represent arbitrary sequence of instructions that must be
scheduled as a unit. e.g. ARM Thumb2 IT block, Intel compare + branch
macro-fusion, or random instruction sequences that are currently modeled as
pseudo instructions that are expanded late.

3. Minimize the amount of changes required in the LLVM code generator,
especially in target independent passes. It must minimize code duplication
(i.e. we don't want code snippets that search for bundle start / end like
all the code in the backend that skip over DBG_VALUE).

4. The representation should make it easy for new code to be oblivious of
bundles. That is, MI passes should not have to check whether something is a
bundle.

 

Given the above, we can rule out a new class (e.g. MachineInstrBundle) right
away. We don't want MachineBasic block to keep a list of MachineInstrBundles
since it will require massive amount of code change. So what are the
choices?

 

Bundle Representation

 

1. A nested MachineInstr: This is the most natural (meaning it looks most
like the real HW bundle) representation. It has the nice property that most
passes do not have to check if a MI is a bundle.The concern here this can
add significant memory overhead if this means adding a ilist or SmallVector
field to keep bundled MIs.

2. Add a bit to MachineInstr: The bit means the next MI in the list is part
of the same bundle. This is very light weight. However it requires many
passes to check wether a MI is part of a bundle.

 

The solution is a combination of both #1 and #2. Conceptually we want a
representation that looks like this:

 

--------------

|  Bundle    | -------

--------------        \

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |

--------------

|  Bundle    | ------

--------------       \

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |           ----------------

         |           |      MI      |

         |           ----------------

         |                   |

         |                  .

         |

--------------

|  Bundle    | ------

--------------       \

         |

        ...

 

 

This is #1, a series of nested MI's. However, we are going to store the
instructions in the same way as it's done right now, i.e. a
list<MachineInstr> on MachineBasicBlocks.  Using #2, we will add a bit to MI
that indicates whether it is part of a bundle.

 

                     ----------------

                     |      MI    * |   (* bit indicates next MI is "glued"
to this MI, i.e. in the same bundle)

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI      |    (no bit, this is the end of the
bundle)

                     --------------

                             |

                     ----------------

                     |      MI    * |   (* a new bundle)

                     ----------------

                             |

                     ----------------

                     |      MI      |

                     ----------------

                             |

                            ...

 

We are going to hide the complexity in the MachineBasicBlock::iterator
instead. That is, the iterator will be changed to visit only the *top level*
instructions (i.e. first instruction in each bundle). We will add another
iterator that allows client to visit all of the MIs for those passes that
want to look into bundles.

 

We can use the same representation for arbitrary sequence of instructions
that cannot be broken up. e.g. Thumb2 IT blocks.

 

                     ----------------

                     |      MI      |   (just a MI)

                     ----------------

                            |

                     ----------------

                     |      MI    * |   (* Start of Thumb2 IT block)

                     ----------------

                            |

                     ----------------

                     |      MI    * | 

                     ----------------

                            |

                     ----------------

                     |      MI      |   (last MI in the block)

                     ----------------

                            |

                     ----------------

                     |      MI      | 

                     ----------------

                            |

                           ...

 

This representation can support VLIW (where top level MI's are all start of
bundles) or non-VLIW (where there can be mix of MIs and bundles). It is also
very cheap since the "Flags" field has plenty of free bits available.

 

Properties of Bundle

 

If MI passes can consider each bundle as a single unit, then how are they
going to examine properties (i.e. flags and operands) of a MI bundle?
Conceptually a the properties of a bundle is the union of the properties of
all the MIs inside the bundle. So a bundle reads all the inputs that the
individual MIs read and it defines all the outputs of the individual MIs.
However, this is not correct when there are intra-bundle dependencies. e.g.

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0<kill>, #c |

-------------------------

 

r0 should not be considered as a source on the bundle since it's defined
inside the bundle and its live range does not extend beyond it. Instead, r0
is a clobber (i.e. dead def).

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0, #c       |

-------------------------

 ...

     = op3 r0, 

 

r0 is a def, not a use.

 

What does this mean? It means in order for passes to operate on a bundle at
a time, it must be able to visit all the defs and uses of a bundle. We have
established that computing the defs and uses of a bundle is not as trivial
as taking the union. This is certainly not something we want to re-compute
every time! This requires a slight change to the bundle representation.

 

                     ----------------

                     |    Bundle  * |   (A MI with special opcode "Bundle")

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI    * |

                     ----------------

                             |

                     ----------------

                     |      MI      |    (no bit, this is the end of the
bundle)

                     --------------

                             |

                     ----------------

                     |    Bundle  * |   (a new bundle)

                     ----------------

                             |

                     ----------------

                     |      MI    * | 

                     ----------------

                             |

                     ----------------

                     |      MI      |

                     ----------------

                             |

                            ...

 

The pseudo bundle instructions should be used to capture properties of the
bundle. When a bundle is finalized the packetizer must add source and def
operands to the pseudo bundle instruction. More on this later.

 

Other properties, such as mayLoad, mayStore, are static properties
associated with opcodes. They cannot be copied. We will add APIs to examine
properties of MIs which will do the *right thing* for bundles (i.e. look
into MIs in bundles).

 

Packetizing

 

The current MI flow looks like this:

 

1. DAG to MI lowering (and pre-RA schedule)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Actual register allocation

4. Post-RA optimizations

5. PEI

6. Post-RA scheduling

 

In the hopefully not very distant future it should look like this:

 

1. DAG to MI lowering (no scheduling!)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Pre-RA scheduling

   3d. Actual register allocation

4. Post-RA optimizations

5. PEI

6. Re-schedule restores, copies

 

The current proposal is for "packetization" to be done as part of the "RA
super pass". Early MI optimization passes such as LICM do not benefit from
operating on bundles. Furthermore, the packetizer should not have to know
how to deal with copies which may later be coalesced, phi nodes, or other
copy like pseudo instructions.

 

Packetization should be done in two phases. The first part decides what MIs
should be bundled together and it add the "bits" which glued MIs together.
This can be done either before pre-RA scheduling. The second part of the
packetization should only be done after register allocation is completed.
There are two very important reason for this.

 

1. Packet finalization *must* add source and def operands to the "Bundle"
pseudo MI. This allows all later passes to handle they transparently.
However, we do not want to do this before register allocation is complete.
Otherwise it introduces new defs and uses of virtual registers and that mess
up MachineRegisterInfo def-use chains.

 

e.g. Now vr0 has two defs!

defs: vr0<dead>, vr3, uses: vr1, vr2

----------------------------

| vr0 = op1 vr1, vr2       |

| vr3 = op2 vr0<kill>, #c  |

----------------------------

 

2. During register allocation, more identity copies will be eliminated while
loads, stores, copies, re-materialized instructions will be introduced. It
makes sense for the second part of packetization to try to fill these new
instructions into empty slots (for VLIW like targets).

 

So the overall flow should look like this:

 

1. DAG to MI lowering (no scheduling!)

2. MI optimizations (LICM, CSE, etc.)

3. Register allocation super pass

   3a. De-ssa (2-address, phi slim)

   3b. Coalescing

   3c. Pre-scheduling packetization (optional)

   3d. Pre-RA scheduling (or integrated packetization)

   3e. Post-scheduling packetization (optional)

   3f. Actual register allocation

   3g. Packet finalization

4. Post-RA optimizations

5. PEI

6. Re-schedule restores, copies

 

Lowering Bundles to MCInst

 

There is no need to add the equivalent of MI bundle to MCInst. A MI bundle
should be concatenated into a single MCInst by storing opcodes as integer
operands. e.g.

 

-------------------------

| r0 = op1 r1, r2       |

| r3 = op2 r0, #c       |

-------------------------

 

=>

 

MCInst: op1 r0, r1, r2, op2, r3, r0, #c

or

MCInst: op1 op2 r0, r1, r2, r3, r0, #c

 

What's Next?

 

I am hoping to find some time to implement the followings in the near
future:

1. Add BUNDLE opcode

2. MachineInstr class changes: new bit, changes to methods such as
eraseFromParent(), isIdenticalTo().

3. Change MachineInstr::iterator to skip over bundled MIs. Rename old
iterator.

4. Add MachineInstr API to check for instruction properties and switch
existing code over.

5. Add API to form a bundle. It would compute the proper def's and use's and
add MachineOperands to the bundle MI.

6. Switch Thumb2 IT block to using MI bundles.

7. Add interface for targets to register their own packetization passes.

 

I would dearly welcome help on any of these tasks especially on 4, 5, 6. I
also would not cry if someone beats me to #6 (or actually any of the tasks.
:-)

 

In the longer term, I would like to see a target independent packetization
pass (I believe one is being reviewed). I would also like to see a target
independent interface for pre-scheduling optimizations that form instruction
sequences (e.g. macro-fusion). Patches welcome!

 

Evan

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111205/af5cdd8f/attachment.html>


More information about the llvm-dev mailing list