[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

Tue Dec 7 02:07:37 PST 2021

>From an M68k point-of-view, I like this. I'm a little uneasy about the
`operand` node though. Is it possible that there will ever be an operand
type which needs multiple encodings?

I also wonder whether $dec mode and slicing inputs might be better handled
by separate nodes. Would something like `(seq (flip 0b0101010) (slice
my_operand.Base, 4, 8))`, be possible?

As you say, I suspect that implementing an interpreter-style disassembler
generator like the fixed length one would be fairly straight-forward (and
much better than the M68k disassembler implementation I provided).

Ricky,

On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <minyihh at uci.edu> wrote:

> (This is a long proposal. If you prefer, here is the web version:
> https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b)
>
> ## Background
> CodeEmitterGen is a TableGen backend that generates instruction encoder
> functions for `MCCodeEmitter` from a concise TableGen syntax. It is,
> however, almost exclusively designed for targets that use fixed-length
> instructions. It's nearly impossible to use this infrastructure to describe
> instruction encoding scheme for ISAs with variable-length instructions,
> like X86 and M68k.
>
> To have a better understanding on this problem, let's look at an example.
> For a fixed-length instruction ISA, developers write the following TableGen
> syntax to describe an instruction encoding:
> ```
> class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
>     bits<32> Inst;
>
>     bits<4> dst;
>     bits<16> imm;
>     let Inst{31-28} = 0b1011;
>     ...
>     let Inst{19-16} = dst;
>     let Inst{15-0} = imm;
> }
> ```
> The `Inst` field tells us the length of an instruction -- 32 bits in this
> case. Each bit in this field describes the encoded value, which is either a
> concrete value or symbolic one like `dst` and `imm` in the example above.
> The `dst` and `imm` variables correspond to the output operand (`$dst`) and
> the second input operand (`$imm`), respectively. Meaning, the encoder
> function (generated by CodeEmitterGen) will eventually insert the encoding
> for these two operands into the right bit ranges (bit 19\~16 for `dst` and
> 15\~0 for `imm`).
>
> Though this TableGen syntax fits well for fixed-length instructions, it
> imposes some difficulties to instructions with variable length and memory
> poerands with complex addressing modes:
>   1. The bit width of the `Inst` field is fixed. Though we can declare the
> field with maximum instruction size in the ISA, it requires extra code to
> adjust the final instruction size.
>   2. Operand encoding can only be placed at fixed bit positions. However,
> the size of an operand in a variable-length instruction might vary.
>   3. In the situation where a single logical operand is consisting of
> multiple MachineOperand-s / MCOperand-s, the current syntax cannot
> reference a sub-operand. Which means we can only reference the entire
> logical operand at places where we actually should put sub-operands. Making
> the TG code less readable and bring more burden to the operand encoding
> functions (because they don't know which sub-operand to encode).
>
> In short, we need a more flexible CodeEmitterGen infrastructure for
> variable-length instructions: describe the instruction encoding in a
> "position independent" fashion and be able to reference sub-operands with
> ease.
>
> ## Proposal
> We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the
> aforementioned shortcomings. It is consisting of new TableGen syntax and
> some modifications to the existing CodeEmitterGen TableGen backend.
>
> Suppose we are dealing with an instruction `MyVarInst`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
> }
>
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_operand:$src);
> }
> ```
> It has the following encoding format:
> ```
> 15 8 0
> ----------------------------------------------------
> | Fixed bits | Sub-operand 0 in source operand |
> ----------------------------------------------------
> X 16
> ----------------------------------------------------
> | Sub-operand 1 in source operand |
> ----------------------------------------------------
> X + 4 X + 1
> ------------------------------------
> | Destination register |
> ------------------------------------
> ```
> We have two different kinds of memory operands:
> ```
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;
>
> def FOO16 : MyVarInst<MemOp16>;
> def FOO32 : MyVarInst<MemOp32>;
> ```
> So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.
>
> To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
>     dag Base;
>     dag Extension;
> }
>
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     dag Inst;
>
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_op:$src);
>
>     let Inst = (seq
>         (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
>         memory_op.Extension,
>         // Destination register
>         (operand "$dst", 4)
>     );
> }
> ```
> Then, we use a slightly different representation
> for `MemOp16` and `MemOp32`:
> ```
> class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg,
> i16imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset", 16);
> }
>
> class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg,
> i32imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset", 32);
> }
>
> def FOO16 : MyVarInst<MemOp16<"src">>;
> def FOO32 : MyVarInst<MemOp32<"src">>;
> ```
>
> This new TableGen syntax uses `dag` rather than `bits<N>` for
> the `Inst` field. Allowing instructions to place their operand (and
> sub-operand) encodings without worrying about the actual bit positions. The
> new syntax is underpinned by two new DAG operators: `seq` and `operand`.
>
> The `seq` operator sequentially places its arguments -- fragments of
> encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes
> from MSB to LSB instead. The `operand` operator references the encoding of
> an operand. Its first DAG argument is a string referencing the name of an
> operand in either `InOperandList` or `OutOperandList` of an instruction. We
> can also reference an sub-operand using syntax like `$<operand
> name>.<sub-operand name>`. The second DAG argument for `operand` is the bit
> width of the encoded operand. The other variant of `operand` is having two
> arguments instead of one that follow the operand referencing string. More
> specifically:
> ```
> (operand "$src.reg", 8, 4)
> ```
> In this case, 8 and 4 represents a bit range -- high bit and low bit,
> respectively -- to the encoded `$src.reg` operand.
>
> Finally, a new sub-component added to the existing CodeEmitterGen TableGen
> backend, VarLenCodeEmitterGen, will turn the above syntax into a C++
> encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the
> same mechanism as the fixed-length instruction version (except few details,
> like it always uses APInt to store the result).
>
> We think the proposed solution has the following advantages:
>   - Flexible and versatile in terms of expressing instruction encodings.
>   - The TableGen syntax is easy to read, write and understand.
>   - Only adds a few new TableGen syntax.
>   - Tightly integrated with the existing CodeEmitterGen.
>
> ### Previous approaches
> Both X86 and M68k -- the only two LLVM targets with variable-length
> instructions -- are using custom instruction encoders. X86 leverages
> TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates
> and numbers every possible combinations of operands and stores the
> corresponding index into a segment of TSFlags for an instruction. This
> approach, of course, requires none trivial amount of workforce to maintain.
>
> M68k, on the other hand, uses an obscured infrastructure called code
> beads. It is conceptually similar to the VarLenCodeEmitterGen we're
> proposing here -- concatenating encoding fragments. Except that the syntax
> is bulky and it uses too many specialized TableGen infrastructures,
> including a separate TableGen backend, that make the maintainence really
> really hard.
>
> ## Patches
> TableGen modifications: https://reviews.llvm.org/D115128
>
> ## FAQ
>   - Do I need to toggle some flags -- either a command line flag or a
> TableGen bit field -- to use the new code emitter scheme?
>     - No, having a `dag` type `Inst` field will automatically opt-in this
> new code emitter scheme.
>   - Can I adopt this for fixed-length instructions?
>     - Absolutely yes. But it's not recommended because CodeEmitterGen can
> generate more optimal encoder functions for fixed-length instructions. The
> TableGen syntax of CodeEmitterGen makes more sense for fixed-length
> instructions, too.
>   - Can X86 adopt this infrastructure?
>     - Theoritically, yes (In practice? I dunno).
>   - What about the disassembler? Can we TableGen-enerate the corresponding
> disassembling functions?
>     - Since we have a structural description of the encoded instruction,
> it's probably easier to create a disassembler from the new TableGen syntax.
> But I haven't worked on that yet.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211207/5ee00468/attachment.html>