[PATCH] D53379: GSYM symbolication format

Mon Oct 29 09:48:23 PDT 2018

markmentovai added a comment.

I’m glad you’re able to work on this publicly again! I’ve made a pass through just the README.md only at first, just to help nail down the format.

================
Comment at: lib/DebugInfo/GSYM/README.md:9
+## Converting DWARF Files to GSYM
+`llvm-dsymutil` is available in the `llvm/tools/gsym` directory and has options to convert DWARF into GSYM files. `llvm-dsymutil` has a `-dwarf` option that specifies a DWARF file to convert into a GSYM file. The output file can be specified with the `-out-file` option.
+```
----------------
llvm-**g**symutil

throughout this file

================
Comment at: lib/DebugInfo/GSYM/README.md:42
+  uint32_t magic;
+  uint16_t version;
+  uint8_t  addr_off_size;
----------------
Expectations around this field?

That is: should a v1-compliant reader reject v2 input (thereby making this field essentially just another magic number), or should v2 input be considered a strict superset of the v1 format, such that a v1 reader could process it correctly while ignoring whatever extensions v2 defines (probably via additional fields in this struct)?

For some purposes, a “header size” field may be better than a version field, because it indicates not just the version of the header structure (assuming you only grow), but also tells you where the data that follows (in this case, the address table) actually begins.

================
Comment at: lib/DebugInfo/GSYM/README.md:45
+  uint8_t  uuid_size;
+  uint64_t base_address;
+  uint32_t num_addrs;
----------------
Define what the base_address is relative to. The load address of a particular segment or section? The lowest possible load address? The ELF or Mach-O header?

The symbol table analogue is that each address in the table is relative to a particular section (except for special cases, such as those interpreted as “absolute”), in nlist::n_sect or ElfN_Sym::st_shndx. This should be nailed down for GSYM a bit better too. This has implications on the common base_address approach.

================
Comment at: lib/DebugInfo/GSYM/README.md:49
+  uint32_t strtab_size;
+  uint8_t  uuid[20];
+};
----------------
The 20-byte UUID matches Mach-O practice, but ELF IDs are just a bunch of bytes. They may be 20-byte UUIDs or UUID-like, but they may also be longer or shorter.

Do you want to provide guidance for what to do in case the corresponding binary image doesn’t have a build ID at all?

================
Comment at: lib/DebugInfo/GSYM/README.md:53
+
+The magic value is set to `GSYM_MAGIC` and allows quick and easy detection of this file format when it is loaded. Addresses in the address table are stored as offsets from a 64 bit address found in `Header.base_address`. This allows the address table to contain 32, 16 or 8 bit offsets, instead of a table of full sized addresses. The file size is smaller and causes fewer pages to be touched during address lookups when the address table is smaller. The size of the address offsets in the address table is specified in the header in `Header.addr_off_size`. The header contains a UUID to ensure the GSYM file can be properly matched to the object ELf or mach-o file that created the stack trace. The header specifies the location of the string table for all strings contained in the GSYM file, or can point to an existing string table within a ELF or mach-o file.
+
----------------
ELF, not ELf.

(Would be easier to review if this were hard-wrapped.)

================
Comment at: lib/DebugInfo/GSYM/README.md:53
+
+The magic value is set to `GSYM_MAGIC` and allows quick and easy detection of this file format when it is loaded. Addresses in the address table are stored as offsets from a 64 bit address found in `Header.base_address`. This allows the address table to contain 32, 16 or 8 bit offsets, instead of a table of full sized addresses. The file size is smaller and causes fewer pages to be touched during address lookups when the address table is smaller. The size of the address offsets in the address table is specified in the header in `Header.addr_off_size`. The header contains a UUID to ensure the GSYM file can be properly matched to the object ELf or mach-o file that created the stack trace. The header specifies the location of the string table for all strings contained in the GSYM file, or can point to an existing string table within a ELF or mach-o file.
+
----------------
markmentovai wrote:
> ELF, not ELf.
> 
> (Would be easier to review if this were hard-wrapped.)
So valid values for addr_off_size are 8, 16, and 32, it sounds. Not 64?

Not representing as n and interpreting as 2^(n+3) or 2^(n+4)? That might give you a better layout for future placement of extension bits in this field.

================
Comment at: lib/DebugInfo/GSYM/README.md:75
+```
+The file table starts with a 32 bit count of the number of files that are used in all of the address data, followed by that number of `FileInfo` structures.
+
----------------
Rather than putting this in the file table, could you move it into the header? That’d make it easier to locate data that follows (like the string table) without having to pull another page in from disk if the file table isn’t otherwise needed.

It also seems weird that the file table is the only structure that’s not sized in the header. You’ve got num_addrs and strtab_size there.

================
Comment at: lib/DebugInfo/GSYM/README.md:93
+information must be stored as 32 bit string table offsets into this string
+table. Strings are stored as NULL terminated UTF8 strings. The format of the
+string table starts with an empty string at offset zero followed by zero or
----------------
As much as I’d love for the world to grow up and use UTF-8 everywhere, specifying it here is maybe a little stronger than can reliably be guaranteed.

================
Comment at: lib/DebugInfo/GSYM/README.md:96
+more strings. The format is the same as the DWARF .debug_str format with an
+additional restriction of being required to start with empty string. The string
+table is specified in the GSYM header with the `Header.strtab_offset` and
----------------
What’s the empty string for?

================
Comment at: lib/DebugInfo/GSYM/README.md:98
+table is specified in the GSYM header with the `Header.strtab_offset` and
+`Header.strtab_size` fields. The `Header.strtab_offset` is an absolute offset in
+the file for the string table. This allows the string table to share other
----------------
If your base_offset is 64 bits wide in contemplation of large files, why is strtab_offset only 32 bits?

================
Comment at: lib/DebugInfo/GSYM/README.md:99
+`Header.strtab_size` fields. The `Header.strtab_offset` is an absolute offset in
+the file for the string table. This allows the string table to share other
+string table sections that might exist in the file when the GSYM data is a
----------------
“the file” = relative to what? The image header? The first section? The section loaded at the lowest address? A special magic section? The file on disk that hosts the image (which may be fat?)

================
Comment at: lib/DebugInfo/GSYM/README.md:112
+```
+It starts with a 32 bit size for the address range of the functiopn and is followed by the 32 bit string table offset for the name of the function. The size of the address range is important to encode as it stops address lookups from matching if the address is between two functions in some padding. This is followed by an array of address data information:
+```
----------------
functiopn

================
Comment at: lib/DebugInfo/GSYM/README.md:112
+```
+It starts with a 32 bit size for the address range of the functiopn and is followed by the 32 bit string table offset for the name of the function. The size of the address range is important to encode as it stops address lookups from matching if the address is between two functions in some padding. This is followed by an array of address data information:
+```
----------------
markmentovai wrote:
> functiopn
Is the name of the function encoded in decorated form?

================
Comment at: lib/DebugInfo/GSYM/README.md:116
+    uint32_t type;
+    uint32_t length;
+    uint8_t data[length];
----------------
If you store a number-of-address-data-elements field in AddressInfo, you can save four bytes per AddressInfo by not needing to carry the unnecessary “length” field for EndOfList AddressData entries.

================
Comment at: lib/DebugInfo/GSYM/README.md:120
+```
+The address data starts with a 32 bit type, followed by a 32 bit length, followed by an array of bytes that encode each specify kind of data.
+The `AddressData.type` is an enumeration value:
----------------
specify → specific

================
Comment at: lib/DebugInfo/GSYM/README.md:129
+```
+The `AddressInfo.data[]` is encoded as a vector of AddressData structs that is terminated by a `AddressData` struct whose type is set to `InfoType.EndOfList`. This allows the GSYM file format the contain arbitrary data for any address range and allows us to expand the GSYM capabilities as we find more uses for it.
+
----------------
But this also means that you may wind up expressing things redundantly. Address ranges that apply to a single conceptual LineTableInfo and a single conceptual InlineInfo may not coincide. The definition here means that you’re going to have to split everything up at each boundary. You wind up, for example, having to express the same InlineInfo twice, just because there was a break in LineTableInfo.

This becomes more important as you add other InfoTypes, such as an InfoType for CFI, as Leonard pointed out would be necessary. Same goes for an InfoType that would allow recovery of function parameters and local variables.

================
Comment at: lib/DebugInfo/GSYM/README.md:133
+
+`InfoType::LineTableInfo` is a modified version of the DWARF line tables that efficiently stores line table information for each function. DWARF stores line table information for an entire source file and includes all functions. Having each function's line table encoded separately allows fewer pages to be touched when looking up the line entry for a specific address. The information is optional and can be omitted fo address data that is from a symbol or label where no line table information is available.
+
----------------
fo → for

================
Comment at: lib/DebugInfo/GSYM/README.md:133
+
+`InfoType::LineTableInfo` is a modified version of the DWARF line tables that efficiently stores line table information for each function. DWARF stores line table information for an entire source file and includes all functions. Having each function's line table encoded separately allows fewer pages to be touched when looking up the line entry for a specific address. The information is optional and can be omitted fo address data that is from a symbol or label where no line table information is available.
+
----------------
markmentovai wrote:
> fo → for
Can you describe the format, or at least the modifications to DWARF?

================
Comment at: lib/DebugInfo/GSYM/README.md:135
+
+`InfoType::InlineInfo` is a format that encodes inline call stacks. This information is optional and doesn't need to be included for each address. If the function has no inlined functions this data should not be included.
----------------
Same: can you describe the format?

https://reviews.llvm.org/D53379