[llvm] [TableGen][DecoderEmitter] Add option to emit type-specialized code (PR #146593)

Wed Aug 20 13:08:39 PDT 2025

jurahul wrote:

Ok, I finally have some perf data. I tested 2 configurations: one where there is a single `decodeInstructionImpl` function that operates on the highest bitwidth and that's what all the entry points call (so less code size) and the second where it's specialized to each bit width (more code size) and ran against AMDGPU and RISCV mc-disassembler test (~250).  Note that in either case, the final call to `decodeToMCInst` still uses the actual bitwidth, so the difference is that `OPC_ExtractField` and `OPC_CheckField` ops in the decoder tables operate at the highest bit width vs specialized for the actual bit width. 

Ignoring any outliers, I see upto 10% perf regression if we always use the highest bitwidth. Here's an example:

```
 for i in {0..10}; do  ./build/bin/llvm-mc -triple=amdgcn -mcpu=gfx950 -disassemble -show-encoding --runs=3000 --time-trace --time-trace-file=/tmp/xyz -o /dev/null  < /extra_drive/upstream-llvm/llvm-project/llvm/test/MC/Disassembler/AMDGPU/gfx950_dasm_xdlops.txt && python3 -m json.tool /tmp/xyz | grep -A 3 "Total getInstruction" | grep "avg us"; done
                "avg us": 527
                "avg us": 516
                "avg us": 520
                "avg us": 517
                "avg us": 521
                "avg us": 525
                "avg us": 518
                "avg us": 508
                "avg us": 509
                "avg us": 509
                "avg us": 509
```

vs

```
for i in {0..10}; do  ./build/bin/llvm-mc -triple=amdgcn -mcpu=gfx950 -disassemble -show-encoding --runs=3000 --time-trace --time-trace-file=/tmp/xyz -o /dev/null  < /extra_drive/upstream-llvm/llvm-project/llvm/test/MC/Disassembler/AMDGPU/gfx950_dasm_xdlops.txt && python3 -m json.tool /tmp/xyz | grep -A 3 "Total getInstruction" | grep "avg us"; done
                "avg us": 479
                "avg us": 481
                "avg us": 478
                "avg us": 479
                "avg us": 479
                "avg us": 478
                "avg us": 476
                "avg us": 470
                "avg us": 456
                "avg us": 470
                "avg us": 464
```

This is by changing the TimeProfiler to report microsecs vs millisecs. Based on this data, I am concluding that we need to prefer decode speed over code size and go with a specialized `decodeInstruction` per bit width. @topperc and @s-barannikov please let me know if this makes sense. I can then work on addressing outstanding comments and resurrecting this back for another round of reviews.

https://github.com/llvm/llvm-project/pull/146593