<div dir="ltr"><div dir="ltr">From an M68k point-of-view, I like this. I'm a little uneasy about the `operand` node though. Is it possible that there will ever be an operand type which needs multiple encodings?</div><div dir="ltr"><br></div><div dir="ltr">I also wonder whether $dec mode and slicing inputs might be better handled by separate nodes. Would something like `(seq (flip 0b0101010) (slice my_operand.Base, 4, 8))`, be possible?</div><div><br></div><div>As you say, I suspect that implementing an interpreter-style disassembler generator like the fixed length one would be fairly straight-forward (and much better than the M68k disassembler implementation I provided).</div><div><br></div><div>Ricky,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 6 Dec 2021 at 03:34, Min-Yih Hsu <<a href="mailto:minyihh@uci.edu">minyihh@uci.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;">(This is a long proposal. If you prefer, here is the web version: <a href="https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b" target="_blank">https://gist.github.com/mshockwave/66e98d099256deefc062633909bb7b5b</a>)<div><br></div><div>## Background<br>CodeEmitterGen is a TableGen backend that generates instruction encoder functions for `MCCodeEmitter` from a concise TableGen syntax. It is, however, almost exclusively designed for targets that use fixed-length instructions. It's nearly impossible to use this infrastructure to describe instruction encoding scheme for ISAs with variable-length instructions, like X86 and M68k.<br><br>To have a better understanding on this problem, let's look at an example. For a fixed-length instruction ISA, developers write the following TableGen syntax to describe an instruction encoding:<br>```<br>class MyInst<(outs GR64:$dst), (ins GR64, i16imm:$imm)> : Instruction {<br>    bits<32> Inst;<br><br>    bits<4> dst;<br>    bits<16> imm;<br>    let Inst{31-28} = 0b1011;<br>    ...<br>    let Inst{19-16} = dst;<br>    let Inst{15-0} = imm;<br>}<br>```<br>The `Inst` field tells us the length of an instruction -- 32 bits in this case. Each bit in this field describes the encoded value, which is either a concrete value or symbolic one like `dst` and `imm` in the example above. The `dst` and `imm` variables correspond to the output operand (`$dst`) and the second input operand (`$imm`), respectively. Meaning, the encoder function (generated by CodeEmitterGen) will eventually insert the encoding for these two operands into the right bit ranges (bit 19\~16 for `dst` and 15\~0 for `imm`).<br><br>Though this TableGen syntax fits well for fixed-length instructions, it imposes some difficulties to instructions with variable length and memory poerands with complex addressing modes:<br>  1. The bit width of the `Inst` field is fixed. Though we can declare the field with maximum instruction size in the ISA, it requires extra code to adjust the final instruction size.<br>  2. Operand encoding can only be placed at fixed bit positions. However, the size of an operand in a variable-length instruction might vary.<br>  3. In the situation where a single logical operand is consisting of multiple MachineOperand-s / MCOperand-s, the current syntax cannot reference a sub-operand. Which means we can only reference the entire logical operand at places where we actually should put sub-operands. Making the TG code less readable and bring more burden to the operand encoding functions (because they don't know which sub-operand to encode).<br><br>In short, we need a more flexible CodeEmitterGen infrastructure for variable-length instructions: describe the instruction encoding in a "position independent" fashion and be able to reference sub-operands with ease.<br><br>## Proposal<br>We propose a new infrastructure, called VarLenCodeEmitterGen, to solve the aforementioned shortcomings. It is consisting of new TableGen syntax and some modifications to the existing CodeEmitterGen TableGen backend.<br><br>Suppose we are dealing with an instruction `MyVarInst`:<br>```<br>class MyMemOperand<dag sub_ops> : Operand<iPTR> {<br>    let MIOperandInfo = sub_ops;<br>}<br><br>class MyVarInst<MyMemOperand memory_op> : Instruction {<br>    let OutOperandList = (outs GR64:$dst);<br>    let InOperandList  = (ins memory_operand:$src);<br>}<br>```<br>It has the following encoding format:<br>```<br><div style="color:rgb(54,54,54);background-color:rgb(255,255,255);font-family:Menlo,Monaco,"Courier New",monospace;font-size:20px;line-height:30px;white-space:pre-wrap"><div>15             8                                   0</div><div>----------------------------------------------------</div><div>|  Fixed bits  |  Sub-operand 0 in source operand  |</div><div>----------------------------------------------------</div><div>X                                                 16</div><div>----------------------------------------------------</div><div>|         Sub-operand 1 in source operand          |</div><div>----------------------------------------------------</div><div>                X + 4                          X + 1</div><div>                ------------------------------------</div><div>                |       Destination register       |</div><div>                ------------------------------------</div></div>```<br>We have two different kinds of memory operands:<br>```<br>def MemOp16 : MyMemOperand<(ops GR64:$reg, i16imm:$offset)>;<br>def MemOp16 : MyMemOperand<(ops GR64:$reg, i32imm:$offset)>;<br><br>def FOO16 : MyVarInst<MemOp16>;<br>def FOO32 : MyVarInst<MemOp32>;<br>```<br>So the size of `FOO16` and `FOO32` will be 36 and 52 bits, respectively.<br><br>To express the encoding, first, we modify `MyVarInst` and `MyMemOperand`:<br>```<br>class MyMemOperand<dag sub_ops> : Operand<iPTR> {<br>    let MIOperandInfo = sub_ops;<br>    dag Base;<br>    dag Extension;<br>}<br><br>class MyVarInst<MyMemOperand memory_op> : Instruction {<br>    dag Inst;<br><br>    let OutOperandList = (outs GR64:$dst);<br>    let InOperandList  = (ins memory_op:$src);<br><br>    let Inst = (seq<br>        (seq:$dec /*Fixed bits*/0b10110111, memory_op.Base),<br>        memory_op.Extension,<br>        // Destination register<br>        (operand "$dst", 4)<br>    );<br>}<br>```<br>Then, we use a slightly different representation for `MemOp16` and `MemOp32`:<br>```<br>class MemOp16<string op_name> : MyMemOperand<(ops GR64:$reg, i16imm:$offset)> {<br>    let Base = (operand "$"#op_name#".reg", 8);<br>    let Extension = (operand "$"#op_name#".offset", 16);<br>}<br><br>class MemOp32<string op_name> : MyMemOperand<(ops GR64:$reg, i32imm:$offset)> {<br>    let Base = (operand "$"#op_name#".reg", 8);<br>    let Extension = (operand "$"#op_name#".offset", 32);<br>}<br><br>def FOO16 : MyVarInst<MemOp16<"src">>;<br>def FOO32 : MyVarInst<MemOp32<"src">>;<br>```<br><br>This new TableGen syntax uses `dag` rather than `bits<N>` for the `Inst` field. Allowing instructions to place their operand (and sub-operand) encodings without worrying about the actual bit positions. The new syntax is underpinned by two new DAG operators: `seq` and `operand`.<br><br>The `seq` operator sequentially places its arguments -- fragments of encoding -- from LSB to MSB. If the operator is "tagged" by `$dec`, it goes from MSB to LSB instead. The `operand` operator references the encoding of an operand. Its first DAG argument is a string referencing the name of an operand in either `InOperandList` or `OutOperandList` of an instruction. We can also reference an sub-operand using syntax like `$<operand name>.<sub-operand name>`. The second DAG argument for `operand` is the bit width of the encoded operand. The other variant of `operand` is having two arguments instead of one that follow the operand referencing string. More specifically:<br>```<br>(operand "$src.reg", 8, 4)<br>```<br>In this case, 8 and 4 represents a bit range -- high bit and low bit, respectively -- to the encoded `$src.reg` operand.<br><br>Finally, a new sub-component added to the existing CodeEmitterGen TableGen backend, VarLenCodeEmitterGen, will turn the above syntax into a C++ encoder function -- `MCCodeEmitter::getBinaryCodeForInstr` -- that uses the same mechanism as the fixed-length instruction version (except few details, like it always uses APInt to store the result).<br><br>We think the proposed solution has the following advantages:<br>  - Flexible and versatile in terms of expressing instruction encodings.<br>  - The TableGen syntax is easy to read, write and understand.<br>  - Only adds a few new TableGen syntax.<br>  - Tightly integrated with the existing CodeEmitterGen.<br><br>### Previous approaches<br>Both X86 and M68k -- the only two LLVM targets with variable-length instructions -- are using custom instruction encoders. X86 leverages TSFlags in `MCInst` to carry encoding info. Simply speaking, X86 enumerates and numbers every possible combinations of operands and stores the corresponding index into a segment of TSFlags for an instruction. This approach, of course, requires none trivial amount of workforce to maintain.<br><br>M68k, on the other hand, uses an obscured infrastructure called code beads. It is conceptually similar to the VarLenCodeEmitterGen we're proposing here -- concatenating encoding fragments. Except that the syntax is bulky and it uses too many specialized TableGen infrastructures, including a separate TableGen backend, that make the maintainence really really hard.<br><br>## Patches<br>TableGen modifications: <a href="https://reviews.llvm.org/D115128" target="_blank">https://reviews.llvm.org/D115128</a><br><br>## FAQ<br>  - Do I need to toggle some flags -- either a command line flag or a TableGen bit field -- to use the new code emitter scheme?<br>    - No, having a `dag` type `Inst` field will automatically opt-in this new code emitter scheme.<br>  - Can I adopt this for fixed-length instructions?<br>    - Absolutely yes. But it's not recommended because CodeEmitterGen can generate more optimal encoder functions for fixed-length instructions. The TableGen syntax of CodeEmitterGen makes more sense for fixed-length instructions, too.<br>  - Can X86 adopt this infrastructure?<br>    - Theoritically, yes (In practice? I dunno).<br>  - What about the disassembler? Can we TableGen-enerate the corresponding disassembling functions?<br>    - Since we have a structural description of the encoded instruction, it's probably easier to create a disassembler from the new TableGen syntax. But I haven't worked on that yet.<br><br></div></div></blockquote></div></div>