[LLVMdev] RFC: Machine Instruction Bundle

Sat Dec 3 07:12:20 PST 2011

Hi,

I'm glad to see some action with regard to static instruction
scheduling and VLIW support in LLVM. I have some questions and
remarks which might not be relevant as I'm not totally familiar
with the current code generation framework of LLVM nor your plan.

On 12/02/2011 10:40 PM, Evan Cheng wrote:
> 2. It must be flexible enough to represent more than VLIW bundles. It should be
> useful to represent arbitrary sequence of instructions that must be scheduled as
> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
> instruction sequences that are currently modeled as pseudo instructions that are
> expanded late.

The concept of a "VLIW bundle" is to mark a set of instructions that
should/could be executed in *parallel*. A static parallel instruction
schedule for a single instruction cycle, that is.

In other words, with a VLIW target a bundle might not be just "an atomic,
possibly sequentially executed chunk of instructions" or "a set of
instructions that can be executed in parallel but also sequentially".
In some architectures, the sequential execution might break the schedule
due to visible function unit pipeline latencies and no hardware interlocking.

Is it wise to mix the two concepts of "parallel instructions" and the looser
"instructions that should be executed together"? The "parallel semantics"
implies changes to how the scheduling is done (the earliest/latest cycle where
an instruction can be scheduled) and also, e.g., the register allocation's live
ranges (if allocating regs on a "packetized" = parallel code)?

Moreover, the definition of VLIW parallel bundle implies that there cannot be
no "intra bundle dependencies", otherwise those instructions could not be 
executed in parallel in the reality.

For example, looking at your example of a bundle with "intra-bundle
dependencies":

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

In case of a static VLIW target the semantics of this instruction is that these
two "RISC instructions are executed in parallel, period". Thus, the first
instruction cannot depend on the latter (or the other way around) but op2 reads
the old value of r0, not the one written in the same bundle.

It depends on the architecture's data hazard detection support, register file 
bypasses, etc. whether the r0 update of the 1st instruction is available to
the second instruction in the bundle or whether the new r0 value can be read
only by the succeeding instruction bundles. If it is available, the execution
is sequential in reality as op1 must produce the value before op2 can
execute.

Itanium machines are an example of "parallel bundle architectures"
(and of course also other "more traditional" VLIWs are, like the TI C64x[2]):

"EPIC allows compilers to define independent instruction sequences, which allows 
hardware to ignore dependency checks between these instructions.  This same 
hardware functionality in OOO RISC designs is very costly and complex."
[1]

As an example of the "not truly parallel instruction bundles", on the other
hand, we have played a bit with the Cell SPU which is quite static architecture
but still has hardware data hazard detection and hardware interlocking. It
would differentiate between your case and the one where the order is different
because it follows the sequential instruction order in its hardware data
dependence resolving logic and stalls the pipeline (thus does not really
execute the instructions in parallel) if the sequential order has data hazards.

For how to actually represent the (parallel) instruction bundles I do not have
a strong opinion, as long as these semantic difference between a "parallel
bundle" and "just a chunk of instructions that should be executed together" are
made clear and adhered to everywhere in the code generation.

[1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
[2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf

Best regards,
-- 
Pekka from the TCE project
http://tce.cs.tut.fi