[LLVMdev] RFC: Machine Instruction Bundle

Tue Dec 6 14:34:22 PST 2011

Indeed, there are strict VLIW architectures out there and VLIW 
architectures that leverage some aspects of conventional architectures, 
sometimes taking advantage of how the microarchitecture was implemented.

-------------------------
| r0 = op1 r1, r2       |
| r3 = op2 r0<kill>, #c |
-------------------------

For instance, some of such VLIW architectures allow that the programmer 
specifies whether "r0" in "op2" will have the value from t-1 (before the 
bundle) or from t+1 (the result produced by "op1").  If anything, 
because the forwarding of results is close enough in the pipeline to be 
used in the same bundle.

It seems to me that such subtleties are better left near the 
target-dependent code, but the target-independent code should refrain 
from making strict interpretations of what a bundle should look like.

When it gets to the MC layer, it seems to me that overloading the MCI 
operands with opcodes is a bit unusual, but, at first glance, it might 
not be too hard to handle.

-- 
Evandro Menezes        Austin, TX        emenezes at codeaurora.org
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

On 12/03/11 09:12, Pekka Jääskeläinen wrote:
> Hi,
>
> I'm glad to see some action with regard to static instruction
> scheduling and VLIW support in LLVM. I have some questions and
> remarks which might not be relevant as I'm not totally familiar
> with the current code generation framework of LLVM nor your plan.
>
> On 12/02/2011 10:40 PM, Evan Cheng wrote:
>> 2. It must be flexible enough to represent more than VLIW bundles. It should be
>> useful to represent arbitrary sequence of instructions that must be scheduled as
>> a unit. e.g. ARM Thumb2 IT block, Intel compare + branch macro-fusion, or random
>> instruction sequences that are currently modeled as pseudo instructions that are
>> expanded late.
>
> The concept of a "VLIW bundle" is to mark a set of instructions that
> should/could be executed in *parallel*. A static parallel instruction
> schedule for a single instruction cycle, that is.
>
> In other words, with a VLIW target a bundle might not be just "an atomic,
> possibly sequentially executed chunk of instructions" or "a set of
> instructions that can be executed in parallel but also sequentially".
> In some architectures, the sequential execution might break the schedule
> due to visible function unit pipeline latencies and no hardware interlocking.
>
> Is it wise to mix the two concepts of "parallel instructions" and the looser
> "instructions that should be executed together"? The "parallel semantics"
> implies changes to how the scheduling is done (the earliest/latest cycle where
> an instruction can be scheduled) and also, e.g., the register allocation's live
> ranges (if allocating regs on a "packetized" = parallel code)?
>
> Moreover, the definition of VLIW parallel bundle implies that there cannot be
> no "intra bundle dependencies", otherwise those instructions could not be
> executed in parallel in the reality.
>
> For example, looking at your example of a bundle with "intra-bundle
> dependencies":
>
> -------------------------
> | r0 = op1 r1, r2       |
> | r3 = op2 r0<kill>, #c |
> -------------------------
>
> In case of a static VLIW target the semantics of this instruction is that these
> two "RISC instructions are executed in parallel, period". Thus, the first
> instruction cannot depend on the latter (or the other way around) but op2 reads
> the old value of r0, not the one written in the same bundle.
>
> It depends on the architecture's data hazard detection support, register file
> bypasses, etc. whether the r0 update of the 1st instruction is available to
> the second instruction in the bundle or whether the new r0 value can be read
> only by the succeeding instruction bundles. If it is available, the execution
> is sequential in reality as op1 must produce the value before op2 can
> execute.
>
> Itanium machines are an example of "parallel bundle architectures"
> (and of course also other "more traditional" VLIWs are, like the TI C64x[2]):
>
> "EPIC allows compilers to define independent instruction sequences, which allows
> hardware to ignore dependency checks between these instructions.  This same
> hardware functionality in OOO RISC designs is very costly and complex."
> [1]
>
> As an example of the "not truly parallel instruction bundles", on the other
> hand, we have played a bit with the Cell SPU which is quite static architecture
> but still has hardware data hazard detection and hardware interlocking. It
> would differentiate between your case and the one where the order is different
> because it follows the sequential instruction order in its hardware data
> dependence resolving logic and stalls the pipeline (thus does not really
> execute the instructions in parallel) if the sequential order has data hazards.
>
> For how to actually represent the (parallel) instruction bundles I do not have
> a strong opinion, as long as these semantic difference between a "parallel
> bundle" and "just a chunk of instructions that should be executed together" are
> made clear and adhered to everywhere in the code generation.
>
> [1] http://www.dig64.org/about/Itanium2_white_paper_public.pdf
> [2] http://www.ti.com/lit/ug/spru395b/spru395b.pdf
>
> Best regards,