[Lldb-commits] [lldb] [lldb] Add a compiler/interpreter of LLDB data formatter bytecode to lldb/examples (PR #113398)

Tue Oct 29 16:38:24 PDT 2024

https://github.com/adrian-prantl updated https://github.com/llvm/llvm-project/pull/113398

>From 0b88de8fe7f5e4ea7f4089be403b271cd7b086d0 Mon Sep 17 00:00:00 2001
From: Adrian Prantl <aprantl at apple.com>
Date: Tue, 22 Oct 2024 16:29:50 -0700
Subject: [PATCH 1/2] Add a compiler/interpreter of LLDB data formatter
 bytecode to examples

---
 lldb/docs/index.rst                           |   1 +
 lldb/docs/resources/formatterbytecode.rst     | 221 ++++++++
 lldb/examples/formatter-bytecode/Makefile     |   8 +
 lldb/examples/formatter-bytecode/compiler.py  | 486 ++++++++++++++++++
 .../formatter-bytecode/test/MyOptional.cpp    |  23 +
 .../formatter-bytecode/test/formatter.py      | 131 +++++
 6 files changed, 870 insertions(+)
 create mode 100644 lldb/docs/resources/formatterbytecode.rst
 create mode 100644 lldb/examples/formatter-bytecode/Makefile
 create mode 100644 lldb/examples/formatter-bytecode/compiler.py
 create mode 100644 lldb/examples/formatter-bytecode/test/MyOptional.cpp
 create mode 100644 lldb/examples/formatter-bytecode/test/formatter.py

diff --git a/lldb/docs/index.rst b/lldb/docs/index.rst
index e2c15d872b4be2..c3c2fb36eb541a 100644
--- a/lldb/docs/index.rst
+++ b/lldb/docs/index.rst
@@ -164,6 +164,7 @@ interesting areas to contribute to lldb.
    resources/fuzzing
    resources/sbapi
    resources/dataformatters
+   resources/formatterbytecode
    resources/extensions
    resources/lldbgdbremote
    resources/lldbplatformpackets
diff --git a/lldb/docs/resources/formatterbytecode.rst b/lldb/docs/resources/formatterbytecode.rst
new file mode 100644
index 00000000000000..e6a52a2bdddcdb
--- /dev/null
+++ b/lldb/docs/resources/formatterbytecode.rst
@@ -0,0 +1,221 @@
+Formatter Bytecode
+==================
+
+Background
+----------
+
+LLDB provides very rich customization options to display data types (see :doc:`/use/variable/`). To use custom data formatters, developers need to edit the global `~/.lldbinit` file to make sure they are found and loaded. In addition to this rather manual workflow, developers or library authors can ship ship data formatters with their code in a format that allows LLDB automatically find them and run them securely.
+
+An end-to-end example of such a workflow is the Swift `DebugDescription` macro (see https://www.swift.org/blog/announcing-swift-6/#debugging ) that translates Swift string interpolation into LLDB summary strings, and puts them into a `.lldbsummaries` section, where LLDB can find them.
+
+This document describes a minimal bytecode tailored to running LLDB formatters. It defines a human-readable assembler representation for the language, an efficient binary encoding, a virtual machine for evaluating it, and format for embedding formatters into binary containers.
+
+Goals
+~~~~~
+
+Provide an efficient and secure encoding for data formatters that can be used as a compilation target from user-friendly representations (such as DIL, Swift DebugDescription, or NatVis).
+
+Non-goals
+~~~~~~~~~
+
+While humans could write the assembler syntax, making it user-friendly is not a goal.
+
+Design of the virtual machine
+-----------------------------
+
+The LLDB formatter virtual machine uses a stack-based bytecode, comparable with DWARF expressions, but with higher-level data types and functions.
+
+The virtual machine has two stacks, a data and a control stack. The control stack is kept separate to make it easier to reason about the security aspects of the virtual machine.
+
+Data types
+~~~~~~~~~~
+
+All objects on the data stack must have one of the following data types. These data types are "host" data types, in LLDB parlance.
+
+* *String* (UTF-8)
+* *Int* (64 bit)
+* *UInt* (64 bit)
+* *Object* (Basically an `SBValue`)
+* *Type* (Basically an `SBType`)
+* *Selector* (One of the predefine functions)
+
+*Object* and *Type* are opaque, they can only be used as a parameters of `call`.
+
+Instruction set
+---------------
+
+Stack operations
+~~~~~~~~~~~~~~~~
+
+These instructions manipulate the data stack directly.
+
+========  ==========  ===========================
+ Opcode    Mnemonic    Stack effect              
+--------  ----------  ---------------------------
+ 0x00      `dup`       `(x -> x x)`              
+ 0x01      `drop`      `(x y -> x)`               
+ 0x02      `pick`      `(x ... UInt -> x ... x)`  
+ 0x03      `over`      `(x y -> x y x)`           
+ 0x04      `swap`      `(x y -> y x)`             
+ 0x05      `rot`       `(x y z -> z x y)`         
+=======  ==========  ===========================
+
+Control flow
+~~~~~~~~~~~~
+
+These manipulate the control stack and program counter.
+
+========  ==========  ============================================================
+ Opcode    Mnemonic    Description              
+--------  ----------  ------------------------------------------------------------
+ 0x10       `{`        push a code block address onto the control stack
+  --        `}`        (technically not an opcode) syntax for end of code block
+ 0x11      `if`        pop a block from the control stack,
+                       if the top of the data stack is nonzero, execute it
+ 0x12      `ifelse`    pop two blocks from the control stack, if the top of
+                       the data stack is nonzero, execute the first,
+                       otherwise the second.
+========  ==========  ============================================================
+
+Literals for basic types
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+========  ===========  ============================================================
+ Opcode    Mnemonic    Description              
+--------  -----------  ------------------------------------------------------------
+ 0x20      `123u`      `( -> UInt)` push an unsigned 64-bit host integer
+ 0x21      `123`       `( -> Int)` push a signed 64-bit host integer
+ 0x22      `"abc"`     `( -> String)` push a UTF-8 host string
+ 0x23      `@strlen`   `( -> Selector)` push one of the predefined function
+                       selectors. See `call`.
+========  ===========  ============================================================
+
+Arithmetic, logic, and comparison operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+========  ==========  ===========================
+ Opcode    Mnemonic    Stack effect              
+--------  ----------  ---------------------------
+ 0x30      `+`         `(x y -> [x+y])`
+ 0x31      `-`          etc ...
+ 0x32      `*`
+ 0x33      `/`
+ 0x34      `%`
+ 0x35      `<<`
+ 0x36      `>>`
+ 0x37      `shra`      (arithmetic shift right)
+ 0x40      `~`
+ 0x41      `|`
+ 0x42      `^`
+ 0x50      `=`
+ 0x51      `!=`
+ 0x52      `<`
+ 0x53      `>`
+ 0x54      `=<`
+ 0x55      `>=`
+========  ==========  ===========================
+
+Function calls
+~~~~~~~~~~~~~~
+
+For security reasons the list of functions callable with `call` is predefined. The supported functions are either existing methods on `SBValue`, or string formatting operations.
+
+========  ==========  ============================================
+ Opcode    Mnemonic    Stack effect              
+--------  ----------  --------------------------------------------
+ 0x60      `call`      `(Object argN ... arg0 Selector -> retval)`
+========  ==========  ============================================
+
+Method is one of a predefined set of *Selectors*.
+
+====  ============================  ===================================================  ==================================
+Sel.  Mnemonic                      Stack Effect                                         Description
+----  ----------------------------  ---------------------------------------------------  ----------------------------------
+0x00  `summary`                     `(Object @summary -> String)`                        `SBValue::GetSummary`
+0x01  `type_summary`                `(Object @type_summary -> String)`                   `SBValue::GetTypeSummary`
+0x10  `get_num_children`            `(Object @get_num_children -> UInt)`                 `SBValue::GetNumChildren`
+0x11  `get_child_at_index`          `(Object UInt @get_child_at_index -> Object)`        `SBValue::GetChildAtIndex`
+0x12  `get_child_with_name`         `(Object String @get_child_with_name -> Object)`     `SBValue::GetChildAtIndex`
+0x13  `get_child_index`             `(Object String @get_child_index -> UInt)`           `SBValue::GetChildIndex`
+0x15  `get_type`                    `(Object @get_type -> Type)`                         `SBValue::GetType`
+0x16  `get_template_argument_type`  `(Object UInt @get_template_argument_type -> Type)`  `SBValue::GetTemplateArgumentType`
+0x17  `cast`                        `(Object Type @cast -> Object)`                      `SBValue::Cast`
+0x20  `get_value`                   `(Object @get_value -> Object)`                      `SBValue::GetValue`
+0x21  `get_value_as_unsigned`       `(Object @get_value_as_unsigned -> UInt)`            `SBValue::GetValueAsUnsigned`
+0x22  `get_value_as_signed`         `(Object @get_value_as_signed -> Int)`               `SBValue::GetValueAsSigned`
+0x23  `get_value_as_address`        `(Object @get_value_as_address -> UInt)`             `SBValue::GetValueAsAddress`
+0x24  `get_value_as_address`        `(Object @get_value_as_address -> UInt)`             `SBValue::GetValueAsAddress`
+0x40  `read_memory_byte`            `(UInt @read_memory_byte -> UInt)`                   `Target::ReadMemory`
+0x41  `read_memory_uint32`          `(UInt @read_memory_uint32 -> UInt)`                 `Target::ReadMemory`
+0x42  `read_memory_int32`           `(UInt @read_memory_int32 -> Int)`                   `Target::ReadMemory`
+0x43  `read_memory_uint64`          `(UInt @read_memory_uint64 -> UInt)`                 `Target::ReadMemory`
+0x44  `read_memory_int64`           `(UInt @read_memory_int64 -> Int)`                   `Target::ReadMemory`
+0x45  `read_memory_address`         `(UInt @read_memory_uint64 -> UInt)`                 `Target::ReadMemory`
+0x46  `read_memory`                 `(UInt Type @read_memory -> Object)`                 `Target::ReadMemory`
+0x50  `fmt`                         `(String arg0 ... @fmt -> String)`                   `llvm::format`
+0x51  `sprintf`                     `(String arg0 ... sprintf -> String)`                `sprintf`
+0x52  `strlen`                      `(String strlen -> String)`                          `strlen in bytes`
+====  ============================  ===================================================  ==================================
+
+Byte Code
+~~~~~~~~~
+
+Most instructions are just a single byte opcode. The only exceptions are the literals:
+
+* *String*: Length in bytes encoded as ULEB128, followed length bytes
+* *Int*: LEB128
+* *UInt*: ULEB128
+* *Selector*: ULEB128
+
+Embedding
+~~~~~~~~~
+
+Expression programs are embedded into an `.lldbformatters` section (an evolution of the Swift `.lldbsummaries` section) that is a dictionary of type names/regexes and descriptions. It consists of a list of records. Each record starts with the following header:
+
+* Version number (ULEB128)
+* Remaining size of the record (minus the header) (ULEB128)
+
+The version number is increased whenever an incompatible change is made. Adding new opcodes is not an incompatible change since consumers can unambiguously detect this and report an error.
+
+Space between two records may be padded with NULL bytes.
+
+In version 1, a record consists of a dictionary key, which is type name or regex.
+
+* Length of the key in bytes (ULEB128)
+* The key (UTF-8)
+
+A regex has to start with `^`, which is part of the regular expression.
+
+This is followed by one or more dictionary values that immediately follow each other and entirely fill out the record size from the header. Each expression program has the following layout:
+
+* Function signature (1 byte)
+* Length of the program (ULEB128)
+* The program bytecode
+
+The possible function signatures are:
+
+=========  ====================== ==========================
+Signature    Mnemonic             Stack Effect
+---------  ---------------------- --------------------------
+  0x00     `@summary`             `(Object -> String)`
+  0x01     `@init`                `(Object -> Object+)`
+  0x02     `@get_num_children`    `(Object+ -> UInt)`
+  0x03     `@get_child_index`     `(Object+ String -> UInt)`
+  0x04     `@get_child_at_index`  `(Object+ UInt -> Object)`
+  0x05     `@get_value`           `(Object+ -> Object)`
+=========  ====================== ==========================
+
+If not specified, the init function defaults to an empty function that just passes the Object along. Its results may be cached and allow common prep work to be done for an Object that can be reused by subsequent calls to the other methods. This way subsequent calls to `@get_child_at_index` can avoid recomputing shared information, for example.
+
+While it is more efficient to store multiple programs per type key, this is not a requirement. LLDB will merge all entries. If there are conflicts the result is undefined.
+
+Execution model
+~~~~~~~~~~~~~~~
+
+Execution begins at the first byte in the program. The program counter of the virtual machine starts at offset 0 of the bytecode and may never move outside the range of the program as defined in the header. The data stack starts with one Object or the result of the `@init` function (`Object+` in the table above).
+
+Error handling
+~~~~~~~~~~~~~~
+
+In version 1 errors are unrecoverable, the entire expression will fail if any kind of error is encountered.
+
diff --git a/lldb/examples/formatter-bytecode/Makefile b/lldb/examples/formatter-bytecode/Makefile
new file mode 100644
index 00000000000000..f544fea9d3f28d
--- /dev/null
+++ b/lldb/examples/formatter-bytecode/Makefile
@@ -0,0 +1,8 @@
+all: test
+
+.PHONY: test
+test:
+	python3 compiler.py
+	mkdir -p _test
+	clang++ -std=c++17 test/MyOptional.cpp -g -o _test/MyOptional
+	lldb _test/MyOptional -o "command script import test/formatter.py" -o "b -p here" -o "r" -o "v x" -o "v y" -o q
diff --git a/lldb/examples/formatter-bytecode/compiler.py b/lldb/examples/formatter-bytecode/compiler.py
new file mode 100644
index 00000000000000..2780d1129f933a
--- /dev/null
+++ b/lldb/examples/formatter-bytecode/compiler.py
@@ -0,0 +1,486 @@
+"""
+Specification, compiler, disassembler, and interpreter
+for LLDB dataformatter bytecode.
+
+See https://lldb.llvm.org/resources/formatterbytecode.html for more details.
+"""
+from __future__ import annotations
+
+# Types
+type_String = 1
+type_Int = 2
+type_UInt = 3
+type_Object = 4
+type_Type = 5
+
+# Opcodes
+opcode = dict()
+
+
+def define_opcode(n, mnemonic, name):
+    globals()["op_" + name] = n
+    if mnemonic:
+        opcode[mnemonic] = n
+    opcode[n] = mnemonic
+
+
+define_opcode(1, "dup", "dup")
+define_opcode(2, "drop", "drop")
+define_opcode(3, "pick", "pick")
+define_opcode(4, "over", "over")
+define_opcode(5, "swap", "swap")
+define_opcode(6, "rot", "rot")
+
+define_opcode(0x10, "{", "begin")
+define_opcode(0x11, "if", "if")
+define_opcode(0x12, "ifelse", "ifelse")
+
+define_opcode(0x20, None, "lit_uint")
+define_opcode(0x21, None, "lit_int")
+define_opcode(0x22, None, "lit_string")
+define_opcode(0x23, None, "lit_selector")
+
+define_opcode(0x30, "+", "plus")
+define_opcode(0x31, "-", "minus")
+define_opcode(0x32, "*", "mul")
+define_opcode(0x33, "/", "div")
+define_opcode(0x34, "%", "mod")
+define_opcode(0x35, "<<", "shl")
+define_opcode(0x36, ">>", "shr")
+define_opcode(0x37, "shra", "shra")
+
+define_opcode(0x40, "&", "and")
+define_opcode(0x41, "|", "or")
+define_opcode(0x42, "^", "xor")
+define_opcode(0x43, "~", "not")
+
+define_opcode(0x50, "=", "eq")
+define_opcode(0x51, "!=", "neq")
+define_opcode(0x52, "<", "lt")
+define_opcode(0x53, ">", "gt")
+define_opcode(0x54, "=<", "le")
+define_opcode(0x55, ">=", "ge")
+
+define_opcode(0x60, "call", "call")
+
+# Function signatures
+sig_summary = 0
+sig_init = 1
+sig_get_num_children = 2
+sig_get_child_index = 3
+sig_get_child_at_index = 4
+
+# Selectors
+selector = dict()
+
+
+def define_selector(n, name):
+    globals()["sel_" + name] = n
+    selector["@" + name] = n
+    selector[n] = "@" + name
+
+
+define_selector(0, "summary")
+define_selector(1, "type_summary")
+
+define_selector(0x10, "get_num_children")
+define_selector(0x11, "get_child_at_index")
+define_selector(0x12, "get_child_with_name")
+define_selector(0x13, "get_child_index")
+define_selector(0x15, "get_type")
+define_selector(0x16, "get_template_argument_type")
+define_selector(0x17, "cast")
+define_selector(0x20, "get_value")
+define_selector(0x21, "get_value_as_unsigned")
+define_selector(0x22, "get_value_as_signed")
+define_selector(0x23, "get_value_as_address")
+
+define_selector(0x40, "read_memory_byte")
+define_selector(0x41, "read_memory_uint32")
+define_selector(0x42, "read_memory_int32")
+define_selector(0x43, "read_memory_unsigned")
+define_selector(0x44, "read_memory_signed")
+define_selector(0x45, "read_memory_address")
+define_selector(0x46, "read_memory")
+
+define_selector(0x50, "fmt")
+define_selector(0x51, "sprintf")
+define_selector(0x52, "strlen")
+
+
+################################################################################
+# Compiler.
+################################################################################
+
+
+def compile(assembler: str) -> bytearray:
+    """Compile assembler into bytecode"""
+    # This is a stack of all in-flight/unterminated blocks.
+    bytecode = [bytearray()]
+
+    def emit(byte):
+        bytecode[-1].append(byte)
+
+    tokens = list(assembler.split(" "))
+    tokens.reverse()
+    while tokens:
+        tok = tokens.pop()
+        if tok == "":
+            pass
+        elif tok == "{":
+            bytecode.append(bytearray())
+        elif tok == "}":
+            block = bytecode.pop()
+            emit(op_begin)
+            emit(len(block))  # FIXME: uleb
+            bytecode[-1].extend(block)
+        elif tok[0].isdigit():
+            if tok[-1] == "u":
+                emit(op_lit_uint)
+                emit(int(tok[:-1]))  # FIXME
+            else:
+                emit(op_lit_int)
+                emit(int(tok))  # FIXME
+        elif tok[0] == "@":
+            emit(op_lit_selector)
+            emit(selector[tok])
+        elif tok[0] == '"':
+            s = bytearray()
+            done = False
+            chrs = tok[1:]
+            while not done:
+                quoted = False
+                for c in chrs:
+                    if quoted:
+                        s.append(ord(c))  # FIXME
+                        quoted = False
+                    elif c == "\\":
+                        quoted = True
+                    elif c == '"':
+                        done = True
+                        break
+                        # FIXME assert this is last in token
+                    else:
+                        s.append(ord(c))
+                if not done:
+                    s.append(ord(" "))
+                    chrs = tokens.pop()
+
+            emit(op_lit_string)
+            emit(len(s))
+            bytecode[-1].extend(s)
+        else:
+            emit(opcode[tok])
+    assert len(bytecode) == 1  # unterminated {
+    return bytecode[0]
+
+
+################################################################################
+# Disassembler.
+################################################################################
+
+
+def disassemble(bytecode: bytearray) -> (str, int):
+    """Disassemble bytecode into (assembler, token starts)"""
+    asm = ""
+    all_bytes = list(bytecode)
+    all_bytes.reverse()
+    blocks = []
+    tokens = [0]
+
+    def next_byte():
+        """Fetch the next byte in the bytecode and keep track of all
+        in-flight blocks"""
+        for i in range(len(blocks)):
+            blocks[i] -= 1
+        tokens.append(len(asm))
+        return all_bytes.pop()
+
+    while all_bytes:
+        b = next_byte()
+        if b == op_begin:
+            asm += "{"
+            length = next_byte()
+            blocks.append(length)
+        elif b == op_lit_uint:
+            b = next_byte()
+            asm += str(b)  # FIXME uleb
+            asm += "u"
+        elif b == op_lit_int:
+            b = next_byte()
+            asm += str(b)
+        elif b == op_lit_selector:
+            b = next_byte()
+            asm += selector[b]
+        elif b == op_lit_string:
+            length = next_byte()
+            s = "'"
+            while length:
+                s += chr(next_byte())
+                length -= 1
+            asm += '"' + repr(s)[2:]
+        else:
+            asm += opcode[b]
+
+        while blocks and blocks[-1] == 0:
+            asm += " }"
+            blocks.pop()
+
+        if all_bytes:
+            asm += " "
+
+    if blocks:
+        asm += "ERROR"
+    return asm, tokens
+
+
+################################################################################
+# Interpreter.
+################################################################################
+
+
+def count_fmt_params(fmt: str) -> int:
+    """Count the number of parameters in a format string"""
+    from string import Formatter
+
+    f = Formatter()
+    n = 0
+    for _, name, _, _ in f.parse(fmt):
+        if name > n:
+            n = name
+    return n
+
+
+def interpret(bytecode: bytearray, control: list, data: list, tracing: bool = False):
+    """Interpret bytecode"""
+    frame = []
+    frame.append((0, len(bytecode)))
+
+    def trace():
+        """print a trace of the execution for debugging purposes"""
+
+        def fmt(d):
+            if isinstance(d, int):
+                return str(d)
+            if isinstance(d, str):
+                return d
+            return repr(type(d))
+
+        pc, end = frame[-1]
+        asm, tokens = disassemble(bytecode)
+        print(
+            "=== frame = {1}, data = {2}, opcode = {0}".format(
+                opcode[b], frame, [fmt(d) for d in data]
+            )
+        )
+        print(asm)
+        print(" " * (tokens[pc]) + "^")
+
+    def next_byte():
+        """Fetch the next byte and update the PC"""
+        pc, end = frame[-1]
+        assert pc < len(bytecode)
+        b = bytecode[pc]
+        frame[-1] = pc + 1, end
+        # At the end of a block?
+        while pc >= end:
+            frame.pop()
+            if not frame:
+                return None
+            pc, end = frame[-1]
+            if pc >= end:
+                return None
+            b = bytecode[pc]
+            frame[-1] = pc + 1, end
+        return b
+
+    while frame[-1][0] < len(bytecode):
+        b = next_byte()
+        if b == None:
+            break
+        if tracing:
+            trace()
+        # Data stack manipulation.
+        if b == op_dup:
+            data.append(data[-1])
+        elif b == op_drop:
+            data.pop()
+        elif b == op_pick:
+            data.append(data[data.pop()])
+        elif b == op_over:
+            data.append(data[-2])
+        elif b == op_swap:
+            x = data.pop()
+            y = data.pop()
+            data.append(x)
+            data.append(y)
+        elif b == op_rot:
+            z = data.pop()
+            y = data.pop()
+            x = data.pop()
+            data.append(z)
+            data.append(x)
+            data.append(y)
+
+        # Control stack manipulation.
+        elif b == op_begin:
+            length = next_byte()
+            pc, end = frame[-1]
+            control.append((pc, pc + length))
+            frame[-1] = pc + length, end
+        elif b == op_if:
+            if data.pop():
+                frame.append(control.pop())
+        elif b == op_ifelse:
+            if data.pop():
+                control.pop()
+                frame.append(control.pop())
+            else:
+                frame.append(control.pop())
+                control.pop()
+
+        # Literals.
+        elif b == op_lit_uint:
+            b = next_byte()  # FIXME uleb
+            data.append(int(b))
+        elif b == op_lit_int:
+            b = next_byte()  # FIXME uleb
+            data.append(int(b))
+        elif b == op_lit_selector:
+            b = next_byte()
+            data.append(b)
+        elif b == op_lit_string:
+            length = next_byte()
+            s = ""
+            while length:
+                s += chr(next_byte())
+                length -= 1
+            data.append(s)
+
+        # Arithmetic, logic, etc.
+        elif b == op_plus:
+            data.append(data.pop() + data.pop())
+        elif b == op_minus:
+            data.append(-data.pop() + data.pop())
+        elif b == op_mul:
+            data.append(data.pop() * data.pop())
+        elif b == op_div:
+            y = data.pop()
+            data.append(data.pop() / y)
+        elif b == op_mod:
+            y = data.pop()
+            data.append(data.pop() % y)
+        elif b == op_shl:
+            y = data.pop()
+            data.append(data.pop() << y)
+        elif b == op_shr:
+            y = data.pop()
+            data.append(data.pop() >> y)
+        elif b == op_shra:
+            y = data.pop()
+            data.append(data.pop() >> y)  # FIXME
+        elif b == op_and:
+            data.append(data.pop() & data.pop())
+        elif b == op_or:
+            data.append(data.pop() | data.pop())
+        elif b == op_xor:
+            data.append(data.pop() ^ data.pop())
+        elif b == op_not:
+            data.append(not data.pop())
+        elif b == op_eq:
+            data.append(data.pop() == data.pop())
+        elif b == op_neq:
+            data.append(data.pop() != data.pop())
+        elif b == op_lt:
+            data.append(data.pop() > data.pop())
+        elif b == op_gt:
+            data.append(data.pop() < data.pop())
+        elif b == op_le:
+            data.append(data.pop() >= data.pop())
+        elif b == op_ge:
+            data.append(data.pop() <= data.pop())
+
+        # Function calls.
+        elif b == op_call:
+            sel = data.pop()
+            if sel == sel_summary:
+                data.append(data.pop().GetSummary())
+            elif sel == sel_get_num_children:
+                data.append(data.pop().GetNumChildren())
+            elif sel == sel_get_child_at_index:
+                index = data.pop()
+                valobj = data.pop()
+                data.append(valobj.GetChildAtIndex(index))
+            elif sel == sel_get_child_with_name:
+                name = data.pop()
+                valobj = data.pop()
+                data.append(valobj.GetChildMemberWithName(name))
+            elif sel == sel_get_child_index:
+                name = data.pop()
+                valobj = data.pop()
+                data.append(valobj.GetIndexOfChildWithName(name))
+            elif sel == sel_get_type:
+                data.append(data.pop().GetType())
+            elif sel == sel_get_template_argument_type:
+                n = data.pop()
+                valobj = data.pop()
+                data.append(valobj.GetTemplateArgumentType(n))
+            elif sel == sel_get_value:
+                data.append(data.pop().GetValue())
+            elif sel == sel_get_value_as_unsigned:
+                data.append(data.pop().GetValueAsUnsigned())
+            elif sel == sel_get_value_as_signed:
+                data.append(data.pop().GetValueAsSigned())
+            elif sel == sel_get_value_as_address:
+                data.append(data.pop().GetValueAsAddress())
+            elif sel == sel_cast:
+                sbtype = data.pop()
+                valobj = data.pop()
+                data.append(valobj.Cast(sbtype))
+            elif sel == sel_strlen:
+                data.append(len(data.pop()))
+            elif sel == sel_fmt:
+                fmt = data.pop()
+                n = count_fmt_params(fmt)
+                args = []
+                for i in range(n):
+                    args.append(data.pop())
+                data.append(fmt.format(*args))
+            else:
+                print("not implemented: " + selector[sel])
+                assert False
+                pass
+    return data[-1]
+
+
+################################################################################
+# Tests.
+################################################################################
+
+import unittest
+
+
+class TestCompiler(unittest.TestCase):
+    def test(self):
+        self.assertEqual(compile("1u dup").hex(), "200101")
+        self.assertEqual(compile('"1u dup"').hex(), "2206317520647570")
+        self.assertEqual(compile("16 < { dup } if").hex(), "21105210010111")
+        self.assertEqual(compile('{ { " } " } }').hex(), "100710052203207d20")
+
+        def roundtrip(asm):
+            self.assertEqual(disassemble(compile(asm))[0], asm)
+
+        roundtrip("1u dup")
+        roundtrip('1u dup "1u dup"')
+        roundtrip("16 < { dup } if")
+        roundtrip('{ { " } " } }')
+
+        self.assertEqual(interpret(compile("1 1 +"), [], []), 2)
+        self.assertEqual(interpret(compile("2 1 1 + *"), [], []), 4)
+        self.assertEqual(
+            interpret(compile('2 1 > { "yes" } { "no" } ifelse'), [], []), "yes"
+        )
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/lldb/examples/formatter-bytecode/test/MyOptional.cpp b/lldb/examples/formatter-bytecode/test/MyOptional.cpp
new file mode 100644
index 00000000000000..abba34439d88f4
--- /dev/null
+++ b/lldb/examples/formatter-bytecode/test/MyOptional.cpp
@@ -0,0 +1,23 @@
+// A bare-bones llvm::Optional reimplementation.
+
+template <typename T> struct MyOptionalStorage {
+  MyOptionalStorage(T val) : value(val), hasVal(true) {}
+  MyOptionalStorage() {}
+  T value;
+  bool hasVal = false;
+};
+
+template <typename T> struct MyOptional {
+  MyOptionalStorage<T> Storage;
+  MyOptional(T val) : Storage(val) {}
+  MyOptional() {}
+  T &operator*() { return Storage.value; }
+};
+
+void stop() {}
+
+int main(int argc, char **argv) {
+  MyOptional<int> x, y = 42;
+  stop(); // break here
+  return *y;
+}
diff --git a/lldb/examples/formatter-bytecode/test/formatter.py b/lldb/examples/formatter-bytecode/test/formatter.py
new file mode 100644
index 00000000000000..8ac58cbc29f2ed
--- /dev/null
+++ b/lldb/examples/formatter-bytecode/test/formatter.py
@@ -0,0 +1,131 @@
+"""
+This is the llvm::Optional data formatter from llvm/utils/lldbDataFormatters.py
+with the implementation replaced by bytecode.
+"""
+from __future__ import annotations
+from compiler import *
+import lldb
+
+
+def __lldb_init_module(debugger, internal_dict):
+    debugger.HandleCommand(
+        "type synthetic add -w llvm "
+        f"-l {__name__}.MyOptionalSynthProvider "
+        '-x "^MyOptional<.+>$"'
+    )
+    debugger.HandleCommand(
+        "type summary add -w llvm "
+        f"-e -F {__name__}.MyOptionalSummaryProvider "
+        '-x "^MyOptional<.+>$"'
+    )
+
+
+def evaluate(assembler: str, data: list):
+    bytecode = compile(assembler)
+    trace = True
+    if trace:
+        print(
+            "Compiled to {0} bytes of bytecode:\n0x{1}".format(
+                len(bytecode), bytecode.hex()
+            )
+        )
+    result = interpret(bytecode, [], data, False)  # trace)
+    if trace:
+        print("--> {0}".format(result))
+    return result
+
+
+# def GetOptionalValue(valobj):
+#    storage = valobj.GetChildMemberWithName("Storage")
+#    if not storage:
+#        storage = valobj
+#
+#    failure = 2
+#    hasVal = storage.GetChildMemberWithName("hasVal").GetValueAsUnsigned(failure)
+#    if hasVal == failure:
+#        return "<could not read MyOptional>"
+#
+#    if hasVal == 0:
+#        return None
+#
+#    underlying_type = storage.GetType().GetTemplateArgumentType(0)
+#    storage = storage.GetChildMemberWithName("value")
+#    return storage.Cast(underlying_type)
+
+
+def MyOptionalSummaryProvider(valobj, internal_dict):
+    #    val = GetOptionalValue(valobj)
+    #    if val is None:
+    #        return "None"
+    #    if val.summary:
+    #        return val.summary
+    #    return val.GetValue()
+    summary = ""
+    summary += ' dup "Storage" @get_child_with_name call'  # valobj storage
+    summary += " dup { swap } if drop"  # storage
+    summary += ' dup "hasVal" @get_child_with_name call'  # storage
+    summary += " @get_value_as_unsigned call"  # storage int(hasVal)
+    summary += ' dup 2 = { drop "<could not read MyOptional>" } {'
+    summary += '   0 = { "None" } {'
+    summary += (
+        "     dup @get_type call 0 @get_template_argument_type call"  # storage type
+    )
+    summary += "     swap"  # type storage
+    summary += '     "value" @get_child_with_name call'  # type value
+    summary += "     swap @cast call"  # type(value)
+    summary += '     dup 0 = { "None" } {'
+    summary += "       dup @summary call { @summary call } { @get_value call } ifelse"
+    summary += "     } ifelse"
+    summary += "   } ifelse"
+    summary += " } ifelse"
+    return evaluate(summary, [valobj])
+
+
+class MyOptionalSynthProvider:
+    """Provides deref support to llvm::Optional<T>"""
+
+    def __init__(self, valobj, internal_dict):
+        self.valobj = valobj
+
+    def num_children(self):
+        # return self.valobj.num_children
+        num_children = " @get_num_children call"
+        return evaluate(num_children, [self.valobj])
+
+    def get_child_index(self, name):
+        # if name == "$$dereference$$":
+        #    return self.valobj.num_children
+        # return self.valobj.GetIndexOfChildWithName(name)
+        get_child_index = ' dup "$$dereference$$" ='
+        get_child_index += " { drop @get_num_children call } {"  # obj name
+        get_child_index += "   @get_child_index call"  # index
+        get_child_index += " } ifelse"
+        return evaluate(get_child_index, [self.valobj, name])
+
+    def get_child_at_index(self, index):
+        # if index < self.valobj.num_children:
+        #    return self.valobj.GetChildAtIndex(index)
+        # return GetOptionalValue(self.valobj) or lldb.SBValue()
+        get_child_at_index = " over over swap"  # obj index index obj
+        get_child_at_index += " @get_num_children call"  # obj index index n
+        get_child_at_index += " < { @get_child_at_index call } {"  # obj index
+
+        get_opt_val = ' dup "Storage" @get_child_with_name call'  # valobj storage
+        get_opt_val += " dup { swap } if drop"  # storage
+        get_opt_val += ' dup "hasVal" @get_child_with_name call'  # storage
+        get_opt_val += " @get_value_as_unsigned call"  # storage int(hasVal)
+        get_opt_val += ' dup 2 = { drop "<could not read MyOptional>" } {'
+        get_opt_val += '   0 = { "None" } {'
+        get_opt_val += (
+            "     dup @get_type call 0 @get_template_argument_type call"  # storage type
+        )
+        get_opt_val += "     swap"  # type storage
+        get_opt_val += '     "value" @get_child_with_name call'  # type value
+        get_opt_val += "     swap @cast call"  # type(value)
+        get_opt_val += "   } ifelse"
+        get_opt_val += " } ifelse"
+
+        get_child_at_index += get_opt_val
+        get_child_at_index += " } ifelse"
+
+        return evaluate(get_child_at_index, [self.valobj, index])

>From cf1b608ed251a4c93296dbf3ea1a05edff4646c8 Mon Sep 17 00:00:00 2001
From: Adrian Prantl <aprantl at apple.com>
Date: Tue, 29 Oct 2024 09:55:18 -0700
Subject: [PATCH 2/2] minor corrections from C++ implementations

---
 lldb/docs/resources/formatterbytecode.rst     | 48 ++++++++++++-------
 lldb/examples/formatter-bytecode/compiler.py  | 13 +++--
 .../formatter-bytecode/test/formatter.py      | 48 ++++++++++++-------
 3 files changed, 69 insertions(+), 40 deletions(-)

diff --git a/lldb/docs/resources/formatterbytecode.rst b/lldb/docs/resources/formatterbytecode.rst
index e6a52a2bdddcdb..aab391113bd44b 100644
--- a/lldb/docs/resources/formatterbytecode.rst
+++ b/lldb/docs/resources/formatterbytecode.rst
@@ -50,30 +50,30 @@ Stack operations
 These instructions manipulate the data stack directly.
 
 ========  ==========  ===========================
- Opcode    Mnemonic    Stack effect              
+ Opcode    Mnemonic    Stack effect
 --------  ----------  ---------------------------
- 0x00      `dup`       `(x -> x x)`              
- 0x01      `drop`      `(x y -> x)`               
- 0x02      `pick`      `(x ... UInt -> x ... x)`  
- 0x03      `over`      `(x y -> x y x)`           
- 0x04      `swap`      `(x y -> y x)`             
- 0x05      `rot`       `(x y z -> z x y)`         
+ 0x00      `dup`       `(x -> x x)`
+ 0x01      `drop`      `(x y -> x)`
+ 0x02      `pick`      `(x ... UInt -> x ... x)`
+ 0x03      `over`      `(x y -> x y x)`
+ 0x04      `swap`      `(x y -> y x)`
+ 0x05      `rot`       `(x y z -> z x y)`
 =======  ==========  ===========================
 
 Control flow
 ~~~~~~~~~~~~
 
-These manipulate the control stack and program counter.
+These manipulate the control stack and program counter. Both `if` and `ifelse` expect a `UInt` at the top of the data stack to represent the condition.
 
 ========  ==========  ============================================================
- Opcode    Mnemonic    Description              
+ Opcode    Mnemonic    Description
 --------  ----------  ------------------------------------------------------------
  0x10       `{`        push a code block address onto the control stack
   --        `}`        (technically not an opcode) syntax for end of code block
- 0x11      `if`        pop a block from the control stack,
+ 0x11      `if`        `(UInt -> )` pop a block from the control stack,
                        if the top of the data stack is nonzero, execute it
- 0x12      `ifelse`    pop two blocks from the control stack, if the top of
-                       the data stack is nonzero, execute the first,
+ 0x12      `ifelse`    `(UInt -> )` pop two blocks from the control stack, if
+                       the top of the data stack is nonzero, execute the first,
                        otherwise the second.
 ========  ==========  ============================================================
 
@@ -81,7 +81,7 @@ Literals for basic types
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
 ========  ===========  ============================================================
- Opcode    Mnemonic    Description              
+ Opcode    Mnemonic    Description
 --------  -----------  ------------------------------------------------------------
  0x20      `123u`      `( -> UInt)` push an unsigned 64-bit host integer
  0x21      `123`       `( -> Int)` push a signed 64-bit host integer
@@ -90,11 +90,25 @@ Literals for basic types
                        selectors. See `call`.
 ========  ===========  ============================================================
 
+Conversion operations
+~~~~~~~~~~~~~~~~~~~~~
+
+========  ===========  ================================================================
+ Opcode    Mnemonic    Description
+--------  -----------  ----------------------------------------------------------------
+ 0x2a      `as_int`   `( UInt -> Int)` reinterpret a UInt as an Int
+ 0x2b      `as_uint`  `( Int -> UInt)` reinterpret an Int as a UInt
+ 0x2c      `is_null`  `( Object -> UInt )` check an object for null `(object ? 0 : 1)`
+========  ===========  ================================================================
+
+
 Arithmetic, logic, and comparison operations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+All of these operations are only defined for `Int` and `UInt` and both operands need to be of the same type. The `>>` operator is an arithmetic shift if the parameters are of type `Int`, otherwise it's a logical shift to the right.
+
 ========  ==========  ===========================
- Opcode    Mnemonic    Stack effect              
+ Opcode    Mnemonic    Stack effect
 --------  ----------  ---------------------------
  0x30      `+`         `(x y -> [x+y])`
  0x31      `-`          etc ...
@@ -103,7 +117,6 @@ Arithmetic, logic, and comparison operations
  0x34      `%`
  0x35      `<<`
  0x36      `>>`
- 0x37      `shra`      (arithmetic shift right)
  0x40      `~`
  0x41      `|`
  0x42      `^`
@@ -121,7 +134,7 @@ Function calls
 For security reasons the list of functions callable with `call` is predefined. The supported functions are either existing methods on `SBValue`, or string formatting operations.
 
 ========  ==========  ============================================
- Opcode    Mnemonic    Stack effect              
+ Opcode    Mnemonic    Stack effect
 --------  ----------  --------------------------------------------
  0x60      `call`      `(Object argN ... arg0 Selector -> retval)`
 ========  ==========  ============================================
@@ -144,7 +157,6 @@ Sel.  Mnemonic                      Stack Effect
 0x21  `get_value_as_unsigned`       `(Object @get_value_as_unsigned -> UInt)`            `SBValue::GetValueAsUnsigned`
 0x22  `get_value_as_signed`         `(Object @get_value_as_signed -> Int)`               `SBValue::GetValueAsSigned`
 0x23  `get_value_as_address`        `(Object @get_value_as_address -> UInt)`             `SBValue::GetValueAsAddress`
-0x24  `get_value_as_address`        `(Object @get_value_as_address -> UInt)`             `SBValue::GetValueAsAddress`
 0x40  `read_memory_byte`            `(UInt @read_memory_byte -> UInt)`                   `Target::ReadMemory`
 0x41  `read_memory_uint32`          `(UInt @read_memory_uint32 -> UInt)`                 `Target::ReadMemory`
 0x42  `read_memory_int32`           `(UInt @read_memory_int32 -> Int)`                   `Target::ReadMemory`
@@ -202,7 +214,7 @@ Signature    Mnemonic             Stack Effect
   0x02     `@get_num_children`    `(Object+ -> UInt)`
   0x03     `@get_child_index`     `(Object+ String -> UInt)`
   0x04     `@get_child_at_index`  `(Object+ UInt -> Object)`
-  0x05     `@get_value`           `(Object+ -> Object)`
+  0x05     `@get_value`           `(Object+ -> String)`
 =========  ====================== ==========================
 
 If not specified, the init function defaults to an empty function that just passes the Object along. Its results may be cached and allow common prep work to be done for an Object that can be reused by subsequent calls to the other methods. This way subsequent calls to `@get_child_at_index` can avoid recomputing shared information, for example.
diff --git a/lldb/examples/formatter-bytecode/compiler.py b/lldb/examples/formatter-bytecode/compiler.py
index 2780d1129f933a..001d5e14f33b18 100644
--- a/lldb/examples/formatter-bytecode/compiler.py
+++ b/lldb/examples/formatter-bytecode/compiler.py
@@ -40,6 +40,10 @@ def define_opcode(n, mnemonic, name):
 define_opcode(0x22, None, "lit_string")
 define_opcode(0x23, None, "lit_selector")
 
+define_opcode(0x2a, "as_int", "as_int")
+define_opcode(0x2b, "as_uint", "as_uint")
+define_opcode(0x2c, "is_null", "is_null")
+
 define_opcode(0x30, "+", "plus")
 define_opcode(0x31, "-", "minus")
 define_opcode(0x32, "*", "mul")
@@ -47,7 +51,6 @@ def define_opcode(n, mnemonic, name):
 define_opcode(0x34, "%", "mod")
 define_opcode(0x35, "<<", "shl")
 define_opcode(0x36, ">>", "shr")
-define_opcode(0x37, "shra", "shra")
 
 define_opcode(0x40, "&", "and")
 define_opcode(0x41, "|", "or")
@@ -357,6 +360,11 @@ def next_byte():
                 length -= 1
             data.append(s)
 
+        elif b == op_as_uint: pass
+        elif b == op_as_int: pass
+        elif b == op_is_null:
+            data.append(1 if data.pop() == None else 0)
+
         # Arithmetic, logic, etc.
         elif b == op_plus:
             data.append(data.pop() + data.pop())
@@ -376,9 +384,6 @@ def next_byte():
         elif b == op_shr:
             y = data.pop()
             data.append(data.pop() >> y)
-        elif b == op_shra:
-            y = data.pop()
-            data.append(data.pop() >> y)  # FIXME
         elif b == op_and:
             data.append(data.pop() & data.pop())
         elif b == op_or:
diff --git a/lldb/examples/formatter-bytecode/test/formatter.py b/lldb/examples/formatter-bytecode/test/formatter.py
index 8ac58cbc29f2ed..9b5e2cab7846b3 100644
--- a/lldb/examples/formatter-bytecode/test/formatter.py
+++ b/lldb/examples/formatter-bytecode/test/formatter.py
@@ -19,14 +19,27 @@ def __lldb_init_module(debugger, internal_dict):
         '-x "^MyOptional<.+>$"'
     )
 
-
+def stringify(bytecode: bytearray) -> str:
+    s = ""
+    in_hex = False
+    for b in bytecode:
+        if ((b < 32 or b > 127 or chr(b) in ['"','`',"'"]) or
+            (in_hex and chr(b).lower() in
+             ['a','b','c','d','e','f','0','1','2','3','4','5','6','7','8','9'])):
+            s+= r'\x' + hex(b)[2:]
+            in_hex = True
+        else:
+            s+=chr(b)
+            in_hex = False
+    return s
+    
 def evaluate(assembler: str, data: list):
     bytecode = compile(assembler)
     trace = True
     if trace:
         print(
-            "Compiled to {0} bytes of bytecode:\n0x{1}".format(
-                len(bytecode), bytecode.hex()
+            "Compiled to {0} bytes of bytecode:\n{1}".format(
+                len(bytecode), stringify(bytecode)
             )
         )
     result = interpret(bytecode, [], data, False)  # trace)
@@ -62,22 +75,21 @@ def MyOptionalSummaryProvider(valobj, internal_dict):
     #    return val.GetValue()
     summary = ""
     summary += ' dup "Storage" @get_child_with_name call'  # valobj storage
-    summary += " dup { swap } if drop"  # storage
-    summary += ' dup "hasVal" @get_child_with_name call'  # storage
-    summary += " @get_value_as_unsigned call"  # storage int(hasVal)
-    summary += ' dup 2 = { drop "<could not read MyOptional>" } {'
-    summary += '   0 = { "None" } {'
-    summary += (
-        "     dup @get_type call 0 @get_template_argument_type call"  # storage type
-    )
-    summary += "     swap"  # type storage
+    summary += ' dup is_null ~ { swap } if drop'  # storage
+    summary += ' dup "hasVal" @get_child_with_name call'  # storage obj(hasVal)
+    summary += ' dup is_null { drop "<could not read MyOptional>" } {'
+    summary += '   @get_value_as_unsigned call'  # storage int(hasVal)
+    summary += '   0u = { "None" } {'
+    summary += '     dup @get_type call'
+    summary += '     0u @get_template_argument_type call'  # storage type
+    summary += '     swap'  # type storage
     summary += '     "value" @get_child_with_name call'  # type value
-    summary += "     swap @cast call"  # type(value)
-    summary += '     dup 0 = { "None" } {'
-    summary += "       dup @summary call { @summary call } { @get_value call } ifelse"
-    summary += "     } ifelse"
-    summary += "   } ifelse"
-    summary += " } ifelse"
+    summary += '     swap @cast call'  # type(value)
+    summary += '     dup is_null { "None" } {'
+    summary += '       dup @summary call { @summary call } { @get_value call } ifelse'
+    summary += '     } ifelse'
+    summary += '   } ifelse'
+    summary += ' } ifelse'
     return evaluate(summary, [valobj])