[llvm-dev] [RFC] A new CodeEmitterGen infrastructure for variable-length instructions

Tue Dec 7 06:14:40 PST 2021

On Tue, 7 Dec 2021 at 11:54, Min-Yih Hsu <minyihh at uci.edu> wrote:

>
> Thanks for the feedbacks!
>
> On Dec 7, 2021, at 6:07 PM, Ricky Taylor <rickytaylor26 at gmail.com> wrote:
>
> From an M68k point-of-view, I like this. I'm a little uneasy about the
> `operand` node though. Is it possible that there will ever be an operand
> type which needs multiple encodings?
>
>
> Do you mean an operand or sub-operand might have different encodings
> depend on its context / bit positions? I don’t know about other (future)
> architectures, M68k doesn’t have this issue: Each memory operand is
> consisting of (multiple) primitive construction(s) like registers and
> immediate, that is, sub-operands. While a sub-operand might be placed in
> arbitrary positions, each sub-operand has a fixed encoding.
>
Yes, I was thinking about the solution in a general sense. For M68k we're
good either way. :)

>
> I can think of a solution for operands / sub-operands whose encodings
> depend on its context: passing custom encoder function name into `operand`.
> So if you write:
> ```
> (operand “$foo”, 4, “encodeFoo”)
> ```
> The foo operand at this particular bit position will be encoded by
> `encodeFoo`.
> (The current CodeEmitterGen has a similar stuff — via the `EncoderMethod`
> field in each operand. But that one is bit-position-invariant. So an
> operand can only use a single custom encoder regardless of its context /
> bit-position)
> BTW, this again shows the advantage of using DAG for carrying encoding
> info: It’s easy to augment with new parameters / features.
>
This sounds good to me. (And I very much agree, using a DAG for this stuff
is very convenient.)

> I also wonder whether $dec mode and slicing inputs might be better handled
> by separate nodes. Would something like `(seq (flip 0b0101010) (slice
> my_operand.Base, 4, 8))`, be possible?
>
>
> IIUC, you’re proposing to replace`(seq (seq:$dec 0b0101010), (operand “…”,
> 4, 8))` with the new syntax.
>
> I think it’s a good idea because then we’re not overloading a single DAG
> operator. It’s trivial to change, too.
>
I guess you could allow custom encoding with something like `(encode
"SpecialMode", "$foo")` (which could call `encodeSpecialMode`, or similar).

Either way, I'm pretty satisfied. :)

>
> Nevertheless, I haven’t implemented `seq:$dec` for fixed bits value so
> actually you can’t write `(seq:$dec 0b0101010)` (This is not hard to
> implement though).
>
>
> As you say, I suspect that implementing an interpreter-style disassembler
> generator like the fixed length one would be fairly straight-forward (and
> much better than the M68k disassembler implementation I provided).
>
>
> Good to hear!
>
> Cheers,
> -Min
>
>
> Ricky,
>
> On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <minyihh at uci.edu> wrote:
>
>> (This is a long proposal. If you prefer, here is the web version:
>> https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b)
>>
>> ## Background
>> CodeEmitterGen is a TableGen backend that generates instruction encoder
>> functions for `MCCodeEmitter` from a concise TableGen syntax. It is,
>> however, almost exclusively designed for targets that use fixed-length
>> instructions. It's nearly impossible to use this infrastructure to describe
>> instruction encoding scheme for ISAs with variable-length instructions,
>> like X86 and M68k.
>>
>> To have a better understanding on this problem, let's look at an example.
>> For a fixed-length instruction ISA, developers write the following TableGen
>> syntax to describe an instruction encoding:
>> ```
>> class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {
>>     bits<32> Inst;
>>
>>     bits<4> dst;
>>     bits<16> imm;
>>     let Inst{31-28} = 0b1011;
>>     ...
>>     let Inst{19-16} = dst;
>>     let Inst{15-0} = imm;
>> }
>> ```
>> The `Inst` field tells us the length of an instruction -- 32 bits in this
>> case. Each bit in this field describes the encoded value, which is either a
>> concrete value or symbolic one like `dst` and `imm` in the example above.
>> The `dst` and `imm` variables correspond to the output operand (`$dst`) and
>> the second input operand (`$imm`), respectively. Meaning, the encoder
>> function (generated by CodeEmitterGen) will eventually insert the encoding
>> for these two operands into the right bit ranges (bit 19\~16 for `dst` and
>> 15\~0 for `imm`).
>>
>> Though this TableGen syntax fits well for fixed-length instructions, it
>> imposes some difficulties to instructions with variable length and memory
>> poerands with complex addressing modes:
>>   1. The bit width of the `Inst` field is fixed. Though we can declare
>> the field with maximum instruction size in the ISA, it requires extra code
>> to adjust the final instruction size.
>>   2. Operand encoding can only be placed at fixed bit positions. However,
>> the size of an operand in a variable-length instruction might vary.
>>   3. In the situation where a single logical operand is consisting of
>> multiple MachineOperand-s / MCOperand-s, the current syntax cannot
>> reference a sub-operand. Which means we can only reference the entire
>> logical operand at places where we actually should put sub-operands. Making
>> the TG code less readable and bring more burden to the operand encoding
>> functions (because they don't know which sub-operand to encode).
>>
>> In short, we need a more flexible CodeEmitterGen infrastructure for
>> variable-length instructions: describe the instruction encoding in a
>> "position independent" fashion and be able to reference sub-operands with
>> ease.
>>
>> ## Proposal
>> We propose a new infrastructure, called VarLenCodeEmitterGen, to solve
>> the aforementioned shortcomings. It is consisting of new TableGen syntax
>> and some modifications to the existing CodeEmitterGen TableGen backend.
>>
>> Suppose we are dealing with an instruction `MyVarInst`:
>> ```
>> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>>     let MIOperandInfo = sub_ops;
>> }
>>
>> class MyVarInst<MyMemOperand memory_op> : Instruction {
>>     let OutOperandList = (outs GR64:$dst);
>>     let InOperandList  = (ins memory_operand:$src);
>> }
>> ```
>> It has the following encoding format:
>> ```
>> 15 8 0
>> ----------------------------------------------------
>> | Fixed bits | Sub-operand 0 in source operand |
>> ----------------------------------------------------
>> X 16
>> ----------------------------------------------------
>> | Sub-operand 1 in source operand |
>> ----------------------------------------------------
>> X + 4 X + 1
>> ------------------------------------
>> | Destination register |
>> ------------------------------------
>> ```
>> We have two different kinds of memory operands:
>> ```
>> def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;
>> def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;
>>
>> def FOO16 : MyVarInst<MemOp16>;
>> def FOO32 : MyVarInst<MemOp32>;
>> ```
>> So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.
>>
>> To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:
>> ```
>> class MyMemOperand<dag sub_ops> : Operand<iPTR> {
>>     let MIOperandInfo = sub_ops;
>>     dag Base;
>>     dag Extension;
>> }
>>
>> class MyVarInst<MyMemOperand memory_op> : Instruction {
>>     dag Inst;
>>
>>     let OutOperandList = (outs GR64:$dst);
>>     let InOperandList  = (ins memory_op:$src);
>>
>>     let Inst = (seq
>>         (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),
>>         memory_op.Extension,
>>         // Destination register
>>         (operand "$dst", 4)
>>     );
>> }
>> ```
>> Then, we use a slightly different representation
>> for `MemOp16` and `MemOp32`:
>> ```
>> class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg,
>> i16imm:$offset)> {
>>     let Base = (operand "$"#op_name#".reg", 8);
>>     let Extension = (operand "$"#op_name#".offset", 16);
>> }
>>
>> class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg,
>> i32imm:$offset)> {
>>     let Base = (operand "$"#op_name#".reg", 8);
>>     let Extension = (operand "$"#op_name#".offset", 32);
>> }
>>
>> def FOO16 : MyVarInst<MemOp16<"src">>;
>> def FOO32 : MyVarInst<MemOp32<"src">>;
>> ```
>>
>> This new TableGen syntax uses `dag` rather than `bits<N>` for
>> the `Inst` field. Allowing instructions to place their operand (and
>> sub-operand) encodings without worrying about the actual bit positions. The
>> new syntax is underpinned by two new DAG operators: `seq` and `operand`.
>>
>> The `seq` operator sequentially places its arguments -- fragments of
>> encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes
>> from MSB to LSB instead. The `operand` operator references the encoding of
>> an operand. Its first DAG argument is a string referencing the name of an
>> operand in either `InOperandList` or `OutOperandList` of an instruction. We
>> can also reference an sub-operand using syntax like `$<operand
>> name>.<sub-operand name>`. The second DAG argument for `operand` is the bit
>> width of the encoded operand. The other variant of `operand` is having two
>> arguments instead of one that follow the operand referencing string. More
>> specifically:
>> ```
>> (operand "$src.reg", 8, 4)
>> ```
>> In this case, 8 and 4 represents a bit range -- high bit and low bit,
>> respectively -- to the encoded `$src.reg` operand.
>>
>> Finally, a new sub-component added to the existing CodeEmitterGen
>> TableGen backend, VarLenCodeEmitterGen, will turn the above syntax into a
>> C++ encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses
>> the same mechanism as the fixed-length instruction version (except few
>> details, like it always uses APInt to store the result).
>>
>> We think the proposed solution has the following advantages:
>>   - Flexible and versatile in terms of expressing instruction encodings.
>>   - The TableGen syntax is easy to read, write and understand.
>>   - Only adds a few new TableGen syntax.
>>   - Tightly integrated with the existing CodeEmitterGen.
>>
>> ### Previous approaches
>> Both X86 and M68k -- the only two LLVM targets with variable-length
>> instructions -- are using custom instruction encoders. X86 leverages
>> TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates
>> and numbers every possible combinations of operands and stores the
>> corresponding index into a segment of TSFlags for an instruction. This
>> approach, of course, requires none trivial amount of workforce to maintain.
>>
>> M68k, on the other hand, uses an obscured infrastructure called code
>> beads. It is conceptually similar to the VarLenCodeEmitterGen we're
>> proposing here -- concatenating encoding fragments. Except that the syntax
>> is bulky and it uses too many specialized TableGen infrastructures,
>> including a separate TableGen backend, that make the maintainence really
>> really hard.
>>
>> ## Patches
>> TableGen modifications: https://reviews.llvm.org/D115128
>>
>> ## FAQ
>>   - Do I need to toggle some flags -- either a command line flag or a
>> TableGen bit field -- to use the new code emitter scheme?
>>     - No, having a `dag` type `Inst` field will automatically opt-in this
>> new code emitter scheme.
>>   - Can I adopt this for fixed-length instructions?
>>     - Absolutely yes. But it's not recommended because CodeEmitterGen can
>> generate more optimal encoder functions for fixed-length instructions. The
>> TableGen syntax of CodeEmitterGen makes more sense for fixed-length
>> instructions, too.
>>   - Can X86 adopt this infrastructure?
>>     - Theoritically, yes (In practice? I dunno).
>>   - What about the disassembler? Can we TableGen-enerate the
>> corresponding disassembling functions?
>>     - Since we have a structural description of the encoded instruction,
>> it's probably easier to create a disassembler from the new TableGen syntax.
>> But I haven't worked on that yet.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211207/515d912c/attachment-0001.html>