[PATCH] D53379: GSYM symbolication format

Mon Oct 29 13:54:12 PDT 2018

clayborg added inline comments.

================
Comment at: lib/DebugInfo/GSYM/README.md:42
+  uint32_t magic;
+  uint16_t version;
+  uint8_t  addr_off_size;
----------------
markmentovai wrote:
> Expectations around this field?
> 
> That is: should a v1-compliant reader reject v2 input (thereby making this field essentially just another magic number), or should v2 input be considered a strict superset of the v1 format, such that a v1 reader could process it correctly while ignoring whatever extensions v2 defines (probably via additional fields in this struct)?
> 
> For some purposes, a “header size” field may be better than a version field, because it indicates not just the version of the header structure (assuming you only grow), but also tells you where the data that follows (in this case, the address table) actually begins.
I was thinking that the magic and version are the only things that are fixed within this format. Anything that follows is specific to s specific version. I think it is nice to be able to change anything you want. Would be very east to later have HeaderV1, HeaderV2, etc. The idea is that the version number gets bumped and anything in the file format can change. I am open to other ideas on this, as this is just how it is currently coded.

================
Comment at: lib/DebugInfo/GSYM/README.md:45
+  uint8_t  uuid_size;
+  uint64_t base_address;
+  uint32_t num_addrs;
----------------
markmentovai wrote:
> Define what the base_address is relative to. The load address of a particular segment or section? The lowest possible load address? The ELF or Mach-O header?
> 
> The symbol table analogue is that each address in the table is relative to a particular section (except for special cases, such as those interpreted as “absolute”), in nlist::n_sect or ElfN_Sym::st_shndx. This should be nailed down for GSYM a bit better too. This has implications on the common base_address approach.
Currently "base_address" is actually the min address for all addresses in the "Address Offsets Table". We can choose to define this differently, but the smaller the address offsets are in the "Address Offsets Table" the better for the size of this file. Address offsets are used for both symbols and function info with debug info (line table + inline info).

================
Comment at: lib/DebugInfo/GSYM/README.md:49
+  uint32_t strtab_size;
+  uint8_t  uuid[20];
+};
----------------
markmentovai wrote:
> The 20-byte UUID matches Mach-O practice, but ELF IDs are just a bunch of bytes. They may be 20-byte UUIDs or UUID-like, but they may also be longer or shorter.
> 
> Do you want to provide guidance for what to do in case the corresponding binary image doesn’t have a build ID at all?
We can define this to just be "uint8_t uuid[];" and allow the header size to vary depending on the Header.uuid_size value if needed. Currently in LLDB, we handle any size UUID. I am find with modifying this to not have a size. This will allow headers to be shorter if needed. What do you think? 

================
Comment at: lib/DebugInfo/GSYM/README.md:53
+
+The magic value is set to `GSYM_MAGIC` and allows quick and easy detection of this file format when it is loaded. Addresses in the address table are stored as offsets from a 64 bit address found in `Header.base_address`. This allows the address table to contain 32, 16 or 8 bit offsets, instead of a table of full sized addresses. The file size is smaller and causes fewer pages to be touched during address lookups when the address table is smaller. The size of the address offsets in the address table is specified in the header in `Header.addr_off_size`. The header contains a UUID to ensure the GSYM file can be properly matched to the object ELf or mach-o file that created the stack trace. The header specifies the location of the string table for all strings contained in the GSYM file, or can point to an existing string table within a ELF or mach-o file.
+
----------------
markmentovai wrote:
> markmentovai wrote:
> > ELF, not ELf.
> > 
> > (Would be easier to review if this were hard-wrapped.)
> So valid values for addr_off_size are 8, 16, and 32, it sounds. Not 64?
> 
> Not representing as n and interpreting as 2^(n+3) or 2^(n+4)? That might give you a better layout for future placement of extension bits in this field.
Address offsets can be 64 bit. I will fix the docs. We could decide to do 24 bit, 40 bit, 48 bit offsets etc if that helps with file size at some point. Not keeping a power of two gives us that option?

================
Comment at: lib/DebugInfo/GSYM/README.md:75
+```
+The file table starts with a 32 bit count of the number of files that are used in all of the address data, followed by that number of `FileInfo` structures.
+
----------------
markmentovai wrote:
> Rather than putting this in the file table, could you move it into the header? That’d make it easier to locate data that follows (like the string table) without having to pull another page in from disk if the file table isn’t otherwise needed.
> 
> It also seems weird that the file table is the only structure that’s not sized in the header. You’ve got num_addrs and strtab_size there.
Yep, easy to move the file count into the header.

================
Comment at: lib/DebugInfo/GSYM/README.md:96
+more strings. The format is the same as the DWARF .debug_str format with an
+additional restriction of being required to start with empty string. The string
+table is specified in the GSYM header with the `Header.strtab_offset` and
----------------
markmentovai wrote:
> What’s the empty string for?
It makes for an easy invalid string table offset value to be zero. Also used in many DWARF string tables. If the empty string is always zero, it makes it easy to look at a string table index in your structure and know it is invalid instead of having to know that "0x12340400" is the empty string (and would differ for each string table. One byte extra only, so thought it was worth doing this like other string tables.

================
Comment at: lib/DebugInfo/GSYM/README.md:98
+table is specified in the GSYM header with the `Header.strtab_offset` and
+`Header.strtab_size` fields. The `Header.strtab_offset` is an absolute offset in
+the file for the string table. This allows the string table to share other
----------------
markmentovai wrote:
> If your base_offset is 64 bits wide in contemplation of large files, why is strtab_offset only 32 bits?
I started with a GSYM stand alone format and added the ability to be in a section later. Easy to bump the string table offset to 64 bit with a 64 bit size. By "base_offset" I believe you mean "base_address"? 

================
Comment at: lib/DebugInfo/GSYM/README.md:99
+`Header.strtab_size` fields. The `Header.strtab_offset` is an absolute offset in
+the file for the string table. This allows the string table to share other
+string table sections that might exist in the file when the GSYM data is a
----------------
markmentovai wrote:
> “the file” = relative to what? The image header? The first section? The section loaded at the lowest address? A special magic section? The file on disk that hosts the image (which may be fat?)
Since this data is in a stand alone file or a section in an object file. it should be relative to the start of the file for a stand alone GSYM file, or the Image header if in ELF (Elf Header) or the Mach-o (Mach-o header in a universal file or single arch file).

================
Comment at: lib/DebugInfo/GSYM/README.md:112
+```
+It starts with a 32 bit size for the address range of the functiopn and is followed by the 32 bit string table offset for the name of the function. The size of the address range is important to encode as it stops address lookups from matching if the address is between two functions in some padding. This is followed by an array of address data information:
+```
----------------
markmentovai wrote:
> markmentovai wrote:
> > functiopn
> Is the name of the function encoded in decorated form?
It is currently encoded in mangled form if that was available from the debug info, or in demangled fully qualified form (calculated in DWARF when to DW_AT_linkage_name is available) or just a base name when the parent decl context is the file. Could be defined as "shortest form that is available for a fully qualified name to the function"?

================
Comment at: lib/DebugInfo/GSYM/README.md:116
+    uint32_t type;
+    uint32_t length;
+    uint8_t data[length];
----------------
markmentovai wrote:
> If you store a number-of-address-data-elements field in AddressInfo, you can save four bytes per AddressInfo by not needing to carry the unnecessary “length” field for EndOfList AddressData entries.
I like the idea of having the number-of-address-data-elements in the AddressInfo and get rid o the EndOfList.

I would like to keep the length field if at all possible because you might want to skip data you don't care about. For example, if we have a AddressData for line table, inline info and for unwind info, you might want to skip to the unwind info. You wouldn't want to have to run the streaming parser for the line table and for the inline info if you just want the unwind info. So I like the length field here as it makes it easy to skip unwanted info.

================
Comment at: lib/DebugInfo/GSYM/README.md:129
+```
+The `AddressInfo.data[]` is encoded as a vector of AddressData structs that is terminated by a `AddressData` struct whose type is set to `InfoType.EndOfList`. This allows the GSYM file format the contain arbitrary data for any address range and allows us to expand the GSYM capabilities as we find more uses for it.
+
----------------
markmentovai wrote:
> But this also means that you may wind up expressing things redundantly. Address ranges that apply to a single conceptual LineTableInfo and a single conceptual InlineInfo may not coincide. The definition here means that you’re going to have to split everything up at each boundary. You wind up, for example, having to express the same InlineInfo twice, just because there was a break in LineTableInfo.
> 
> This becomes more important as you add other InfoTypes, such as an InfoType for CFI, as Leonard pointed out would be necessary. Same goes for an InfoType that would allow recovery of function parameters and local variables.
Some addresses may be redundant, but the line table can be more efficient when it doesn't have many extra opcodes like the actual DWARF line table (stmt_begin, stmt_end, prologue_begin, prologue_end). If we merge all of the info together it encodes less efficiently. Also, you might want to "thin" out a GSYM file to contain only unwind info. If they are all separate chunks, very easy to go. All addresses are stored as offsets from the start address of the AddressInfo, so they tend to be encoded as ULEB numbers. Any lexical scopes are relative to their parent scope (the function or a another scopes). So addresses tend to be encoded very efficiently.

================
Comment at: lib/DebugInfo/GSYM/README.md:133
+
+`InfoType::LineTableInfo` is a modified version of the DWARF line tables that efficiently stores line table information for each function. DWARF stores line table information for an entire source file and includes all functions. Having each function's line table encoded separately allows fewer pages to be touched when looking up the line entry for a specific address. The information is optional and can be omitted fo address data that is from a symbol or label where no line table information is available.
+
----------------
markmentovai wrote:
> markmentovai wrote:
> > fo → for
> Can you describe the format, or at least the modifications to DWARF?
I will add that complete description. Very pared down to just file + line + address (no stmt_begin, stmt_end, prologue_end, epilogue_begin, many other things have been removed and given back to the opcode that advances line and address in 1 byte.

================
Comment at: lib/DebugInfo/GSYM/README.md:135
+
+`InfoType::InlineInfo` is a format that encodes inline call stacks. This information is optional and doesn't need to be included for each address. If the function has no inlined functions this data should not be included.
----------------
markmentovai wrote:
> Same: can you describe the format?
Will do.

https://reviews.llvm.org/D53379