[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

Min-Yih Hsu via llvm-dev llvm-dev at lists.llvm.org
Tue Dec 7 03:54:55 PST 2021


Thanks for the feedbacks!

> On Dec 7, 2021, at 6:07 PM, Ricky Taylor <rickytaylor26 at gmail.com> wrote:
> 
> From an M68k point-of-view, I like this. I'm a little uneasy about the `operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings?

Do you mean an operand or sub-operand might have different encodings depend on its context / bit positions? I don’t know about other (future) architectures, M68k doesn’t have this issue: Each memory operand is consisting of (multiple) primitive construction(s) like registers and immediate, that is, sub-operands. While a sub-operand might be placed in arbitrary positions, each sub-operand has a fixed encoding.

I can think of a solution for operands / sub-operands whose encodings depend on its context: passing custom encoder function name into `operand`. So if you write:
```
(operand “$foo”, 4, “encodeFoo”)
```
The foo operand at this particular bit position will be encoded by `encodeFoo`.
(The current CodeEmitterGen has a similar stuff — via the `EncoderMethod` field in each operand. But that one is bit-position-invariant. So an operand can only use a single custom encoder regardless of its context / bit-position)
BTW, this again shows the advantage of using DAG for carrying encoding info: It’s easy to augment with new parameters / features.

> 
> I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible?

IIUC, you’re proposing to replace`(seq (seq:$dec 0b0101010), (operand “…”, 4, 8))` with the new syntax.

I think it’s a good idea because then we’re not overloading a single DAG operator. It’s trivial to change, too.

Nevertheless, I haven’t implemented `seq:$dec` for fixed bits value so actually you can’t write `(seq:$dec 0b0101010)` (This is not hard to implement though).

> 
> As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).

Good to hear!

Cheers,
-Min
> 
> Ricky,
> 
> On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <minyihh at uci.edu <mailto:minyihh at uci.edu>> wrote:
> (This is a long proposal. If you prefer, here is the web version: https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b <https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b>)
> 
> ## Background
> CodeEmitterGen is a TableGen backend that generates instruction encoder functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however, almost exclusively designed for targets that use fixed-length instructions. It's nearly impossible to use this infrastructure to describe instruction encoding scheme for ISAs with variable-length instructions, like X86 and M68k.
> 
> To have a better understanding on this problem, let's look at an example. For a fixed-length instruction ISA, developers write the following TableGen syntax to describe an instruction encoding:
> ```
> class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
>     bits<32> Inst;
> 
>     bits<4> dst;
>     bits<16> imm;
>     let Inst{31-28} = 0b1011;
>     ...
>     let Inst{19-16} = dst;
>     let Inst{15-0} = imm;
> }
> ```
> The `Inst` field tells us the length of an instruction -- 32 bits in this case. Each bit in this field describes the encoded value, which is either a concrete value or symbolic one like `dst` and `imm` in the example above. The `dst` and `imm` variables correspond to the output operand (`$dst`) and the second input operand (`$imm`), respectively. Meaning, the encoder function (generated by CodeEmitterGen) will eventually insert the encoding for these two operands into the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`).
> 
> Though this TableGen syntax fits well for fixed-length instructions, it imposes some difficulties to instructions with variable length and memory poerands with complex addressing modes:
>   1. The bit width of the `Inst` field is fixed. Though we can declare the field with maximum instruction size in the ISA, it requires extra code to adjust the final instruction size.
>   2. Operand encoding can only be placed at fixed bit positions. However, the size of an operand in a variable-length instruction might vary.
>   3. In the situation where a single logical operand is consisting of multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a sub-operand. Which means we can only reference the entire logical operand at places where we actually should put sub-operands. Making the TG code less readable and bring more burden to the operand encoding functions (because they don't know which sub-operand to encode).
> 
> In short, we need a more flexible CodeEmitterGen infrastructure for variable-length instructions: describe the instruction encoding in a "position independent" fashion and be able to reference sub-operands with ease.
> 
> ## Proposal
> We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the aforementioned shortcomings. It is consisting of new TableGen syntax and some modifications to the existing CodeEmitterGen TableGen backend.
> 
> Suppose we are dealing with an instruction `MyVarInst`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
> }
> 
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_operand:$src);
> }
> ```
> It has the following encoding format:
> ```
> 15             8                                   0
> ----------------------------------------------------
> |  Fixed bits  |  Sub-operand 0 in source operand  |
> ----------------------------------------------------
> X                                                 16
> ----------------------------------------------------
> |         Sub-operand 1 in source operand          |
> ----------------------------------------------------
>                 X + 4                          X + 1
>                 ------------------------------------
>                 |       Destination register       |
>                 ------------------------------------
> ```
> We have two different kinds of memory operands:
> ```
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
> def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;
> 
> def FOO16 : MyVarInst<MemOp16>;
> def FOO32 : MyVarInst<MemOp32>;
> ```
> So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.
> 
> To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
> ```
> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>     let MIOperandInfo = sub_ops;
>     dag Base;
>     dag Extension;
> }
> 
> class MyVarInst<MyMemOperand memory_op> : Instruction {
>     dag Inst;
> 
>     let OutOperandList = (outs GR64:$dst);
>     let InOperandList  = (ins memory_op:$src);
> 
>     let Inst = (seq
>         (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
>         memory_op.Extension,
>         // Destination register
>         (operand "$dst", 4)
>     );
> }
> ```
> Then, we use a slightly different representation for `MemOp16` and `MemOp32`:
> ```
> class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg, i16imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset", 16);
> }
> 
> class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg, i32imm:$offset)> {
>     let Base = (operand "$"#op_name#".reg", 8);
>     let Extension = (operand "$"#op_name#".offset", 32);
> }
> 
> def FOO16 : MyVarInst<MemOp16<"src">>;
> def FOO32 : MyVarInst<MemOp32<"src">>;
> ```
> 
> This new TableGen syntax uses `dag` rather than `bits<N>` for the `Inst` field. Allowing instructions to place their operand (and sub-operand) encodings without worrying about the actual bit positions. The new syntax is underpinned by two new DAG operators: `seq` and `operand`.
> 
> The `seq` operator sequentially places its arguments -- fragments of encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes from MSB to LSB instead. The `operand` operator references the encoding of an operand. Its first DAG argument is a string referencing the name of an operand in either `InOperandList` or `OutOperandList` of an instruction. We can also reference an sub-operand using syntax like `$<operand name>.<sub-operand name>`. The second DAG argument for `operand` is the bit width of the encoded operand. The other variant of `operand` is having two arguments instead of one that follow the operand referencing string. More specifically:
> ```
> (operand "$src.reg", 8, 4)
> ```
> In this case, 8 and 4 represents a bit range -- high bit and low bit, respectively -- to the encoded `$src.reg` operand.
> 
> Finally, a new sub-component added to the existing CodeEmitterGen TableGen backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same mechanism as the fixed-length instruction version (except few details, like it always uses APInt to store the result).
> 
> We think the proposed solution has the following advantages:
>   - Flexible and versatile in terms of expressing instruction encodings.
>   - The TableGen syntax is easy to read, write and understand.
>   - Only adds a few new TableGen syntax.
>   - Tightly integrated with the existing CodeEmitterGen.
> 
> ### Previous approaches
> Both X86 and M68k -- the only two LLVM targets with variable-length instructions -- are using custom instruction encoders. X86 leverages TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates and numbers every possible combinations of operands and stores the corresponding index into a segment of TSFlags for an instruction. This approach, of course, requires none trivial amount of workforce to maintain.
> 
> M68k, on the other hand, uses an obscured infrastructure called code beads. It is conceptually similar to the VarLenCodeEmitterGen we're proposing here -- concatenating encoding fragments. Except that the syntax is bulky and it uses too many specialized TableGen infrastructures, including a separate TableGen backend, that make the maintainence really really hard.
> 
> ## Patches
> TableGen modifications: https://reviews.llvm.org/D115128 <https://reviews.llvm.org/D115128>
> 
> ## FAQ
>   - Do I need to toggle some flags -- either a command line flag or a TableGen bit field -- to use the new code emitter scheme?
>     - No, having a `dag` type `Inst` field will automatically opt-in this new code emitter scheme.
>   - Can I adopt this for fixed-length instructions?
>     - Absolutely yes. But it's not recommended because CodeEmitterGen can generate more optimal encoder functions for fixed-length instructions. The TableGen syntax of CodeEmitterGen makes more sense for fixed-length instructions, too.
>   - Can X86 adopt this infrastructure?
>     - Theoritically, yes (In practice? I dunno).
>   - What about the disassembler? Can we TableGen-enerate the corresponding disassembling functions?
>     - Since we have a structural description of the encoded instruction, it's probably easier to create a disassembler from the new TableGen syntax. But I haven't worked on that yet.
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211207/19272946/attachment.html>


More information about the llvm-dev mailing list