[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

Sun Dec 5 19:34:19 PST 2021

(This is a long proposal. If you prefer, here is the web version: https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b <https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b>)

## Background
CodeEmitterGen is a TableGen backend that generates instruction encoder functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however, almost exclusively designed for targets that use fixed-length instructions. It's nearly impossible to use this infrastructure to describe instruction encoding scheme for ISAs with variable-length instructions, like X86 and M68k.

To have a better understanding on this problem, let's look at an example. For a fixed-length instruction ISA, developers write the following TableGen syntax to describe an instruction encoding:
```
class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
    bits<32> Inst;

    bits<4> dst;
    bits<16> imm;
    let Inst{31-28} = 0b1011;
    ...
    let Inst{19-16} = dst;
    let Inst{15-0} = imm;
}
```
The `Inst` field tells us the length of an instruction -- 32 bits in this case. Each bit in this field describes the encoded value, which is either a concrete value or symbolic one like `dst` and `imm` in the example above. The `dst` and `imm` variables correspond to the output operand (`$dst`) and the second input operand (`$imm`), respectively. Meaning, the encoder function (generated by CodeEmitterGen) will eventually insert the encoding for these two operands into the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`).

Though this TableGen syntax fits well for fixed-length instructions, it imposes some difficulties to instructions with variable length and memory poerands with complex addressing modes:
  1. The bit width of the `Inst` field is fixed. Though we can declare the field with maximum instruction size in the ISA, it requires extra code to adjust the final instruction size.
  2. Operand encoding can only be placed at fixed bit positions. However, the size of an operand in a variable-length instruction might vary.
  3. In the situation where a single logical operand is consisting of multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a sub-operand. Which means we can only reference the entire logical operand at places where we actually should put sub-operands. Making the TG code less readable and bring more burden to the operand encoding functions (because they don't know which sub-operand to encode).

In short, we need a more flexible CodeEmitterGen infrastructure for variable-length instructions: describe the instruction encoding in a "position independent" fashion and be able to reference sub-operands with ease.

## Proposal
We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the aforementioned shortcomings. It is consisting of new TableGen syntax and some modifications to the existing CodeEmitterGen TableGen backend.

Suppose we are dealing with an instruction `MyVarInst`:
```
class MyMemOperand<dag sub_ops> : Operand<iPTR> {
    let MIOperandInfo = sub_ops;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
    let OutOperandList = (outs GR64:$dst);
    let InOperandList  = (ins memory_operand:$src);
}
```
It has the following encoding format:
```
15             8                                   0
----------------------------------------------------
|  Fixed bits  |  Sub-operand 0 in source operand  |
----------------------------------------------------
X                                                 16
----------------------------------------------------
|         Sub-operand 1 in source operand          |
----------------------------------------------------
                X + 4                          X + 1
                ------------------------------------
                |       Destination register       |
                ------------------------------------
```
We have two different kinds of memory operands:
```
def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;

def FOO16 : MyVarInst<MemOp16>;
def FOO32 : MyVarInst<MemOp32>;
```
So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.

To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
```
class MyMemOperand<dag sub_ops> : Operand<iPTR> {
    let MIOperandInfo = sub_ops;
    dag Base;
    dag Extension;
}

class MyVarInst<MyMemOperand memory_op> : Instruction {
    dag Inst;

    let OutOperandList = (outs GR64:$dst);
    let InOperandList  = (ins memory_op:$src);

    let Inst = (seq
        (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
        memory_op.Extension,
        // Destination register
        (operand "$dst", 4)
    );
}
```
Then, we use a slightly different representation for `MemOp16` and `MemOp32`:
```
class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg, i16imm:$offset)> {
    let Base = (operand "$"#op_name#".reg", 8);
    let Extension = (operand "$"#op_name#".offset", 16);
}

class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg, i32imm:$offset)> {
    let Base = (operand "$"#op_name#".reg", 8);
    let Extension = (operand "$"#op_name#".offset", 32);
}

def FOO16 : MyVarInst<MemOp16<"src">>;
def FOO32 : MyVarInst<MemOp32<"src">>;
```

This new TableGen syntax uses `dag` rather than `bits<N>` for the `Inst` field. Allowing instructions to place their operand (and sub-operand) encodings without worrying about the actual bit positions. The new syntax is underpinned by two new DAG operators: `seq` and `operand`.

The `seq` operator sequentially places its arguments -- fragments of encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes from MSB to LSB instead. The `operand` operator references the encoding of an operand. Its first DAG argument is a string referencing the name of an operand in either `InOperandList` or `OutOperandList` of an instruction. We can also reference an sub-operand using syntax like `$<operand name>.<sub-operand name>`. The second DAG argument for `operand` is the bit width of the encoded operand. The other variant of `operand` is having two arguments instead of one that follow the operand referencing string. More specifically:
```
(operand "$src.reg", 8, 4)
```
In this case, 8 and 4 represents a bit range -- high bit and low bit, respectively -- to the encoded `$src.reg` operand.

Finally, a new sub-component added to the existing CodeEmitterGen TableGen backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same mechanism as the fixed-length instruction version (except few details, like it always uses APInt to store the result).

We think the proposed solution has the following advantages:
  - Flexible and versatile in terms of expressing instruction encodings.
  - The TableGen syntax is easy to read, write and understand.
  - Only adds a few new TableGen syntax.
  - Tightly integrated with the existing CodeEmitterGen.

### Previous approaches
Both X86 and M68k -- the only two LLVM targets with variable-length instructions -- are using custom instruction encoders. X86 leverages TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates and numbers every possible combinations of operands and stores the corresponding index into a segment of TSFlags for an instruction. This approach, of course, requires none trivial amount of workforce to maintain.

M68k, on the other hand, uses an obscured infrastructure called code beads. It is conceptually similar to the VarLenCodeEmitterGen we're proposing here -- concatenating encoding fragments. Except that the syntax is bulky and it uses too many specialized TableGen infrastructures, including a separate TableGen backend, that make the maintainence really really hard.

## Patches
TableGen modifications: https://reviews.llvm.org/D115128

## FAQ
  - Do I need to toggle some flags -- either a command line flag or a TableGen bit field -- to use the new code emitter scheme?
    - No, having a `dag` type `Inst` field will automatically opt-in this new code emitter scheme.
  - Can I adopt this for fixed-length instructions?
    - Absolutely yes. But it's not recommended because CodeEmitterGen can generate more optimal encoder functions for fixed-length instructions. The TableGen syntax of CodeEmitterGen makes more sense for fixed-length instructions, too.
  - Can X86 adopt this infrastructure?
    - Theoritically, yes (In practice? I dunno).
  - What about the disassembler? Can we TableGen-enerate the corresponding disassembling functions?
    - Since we have a structural description of the encoded instruction, it's probably easier to create a disassembler from the new TableGen syntax. But I haven't worked on that yet.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211206/6e86c47d/attachment.html>