[Lldb-commits] [lldb] [lldb] Add a compiler/interpreter of LLDB data formatter bytecode to lldb/examples (PR #113398)

Thu Oct 24 09:59:40 PDT 2024

================
@@ -0,0 +1,165 @@
+# A bytecode for (LLDB) data formatters
+
+## Background
+
+LLDB provides very rich customization options to display data types (see https://lldb.llvm.org/use/variable.html ). To use custom data formatters, developers typically need to edit the global `~/.lldbinit` file to make sure they are found and loaded. An example for this workflow is the `llvm/utils/lldbDataFormatters.py` script. Because of the manual configuration that is involved, this workflow doesn't scale very well. What would be nice is if developers or library authors could ship ship data formatters with their code and LLDB automatically finds them.
+
+In Swift we added the `DebugDescription` macro (see https://www.swift.org/blog/announcing-swift-6/#debugging ) that translates Swift string interpolation into LLDB summary strings, and puts them into a `.lldbsummaries` section, where LLDB can find them. This works well for simple summaries, but doesn't scale to synthetic child providers or summaries that need to perform some kind of conditional logic or computation. The logical next step would be to store full Python formatters instead of summary strings, but Python code is larger and more importantly it is potentially dangerous to just load an execute untrusted Python code in LLDB.
+
+This document describes a minimal bytecode tailored to running LLDB formatters. It defines a human-readable assembler representation for the language, an efficient binary encoding, a virtual machine for evaluating it, and format for embedding formatters into binary containers.
+
+### Goals
+
+Provide an efficient and secure encoding for data formatters that can be used as a compilation target from user-friendly representations (such as DIL, Swift DebugDescription, or NatVis).
+
+### Non-goals
+
+While humans could write the assembler syntax, making it user-friendly is not a goal.
+
+## Design of the virtual machine
+
+The LLDB formatter virtual machine uses a stack-based bytecode, comparable with DWARF expressions, but with higher-level data types and functions.
+
+The virtual machine has two stacks, a data and a control stack. The control stack is kept separate to make it easier to reason about the security aspects of the VM.
+
+### Data types
+These data types are "host" data types, in LLDB parlance.
+- _String_ (UTF-8)
+- _Int_ (64 bit)
+- _UInt_ (64 bit)
+- _Object_ (Basically an `SBValue`)
+- _Type_ (Basically an `SBType`)
+- _Selector_ (One of the predefine functions)
+
+_Object_ and _Type_ are opaque, they can only be used as a parameters of `call`.
+
+## Instruction set
+
+### Stack operations
+
+These manipulate the data stack directly.
+
+- `dup  (x -> x x)`
+- `drop (x y -> x)`
+- `pick (x ... UInt -> x ... x)`
+- `over (x y -> y)`
+- `swap (x y -> y x)`
+- `rot (x y z -> z x y)`
+
+### Control flow
+
+- `{` pushes a code block address onto the control stack
+- `}` (technically not an opcode) denotes the end of a code block
+- `if` pops a block from the control stack, if the top of the data stack is nonzero, executes it
+- `ifelse` pops two blocks from the control stack, if the top of the data stack is nonzero, executes the first, otherwise the second.
+
+### Literals for basic types
+
+- `123u ( -> UInt)` an unsigned 64-bit host integer.
+- `123 ( -> Int)` a signed 64-bit host integer.
+- `"abc" ( -> String)` a UTF-8 host string.
----------------
DavidSpickett wrote:

I don't know much about string encodings but I know someone asked the other day about printing strings in I think a non-utf8 form. Does UTF-8 work well enough as a container for the other possible encodings?

Or in other words, are there situations where an `ASCII host string` would be needed?

https://github.com/llvm/llvm-project/pull/113398