[PATCH] D53379: GSYM symbolication format

Mon Oct 22 13:24:56 PDT 2018

lemo added a comment.

+Eric Christopher for a DWARF expert's perspective

Hi Greg, this is great stuff. I’m going to take a closer look at the implementation, in the meantime here are a few high level comments:

Important requirements
----------------------

These observations describe areas which would prevent or significantly limit the use of GSYM format for crash processing.  I’m not suggesting that these are blockers for checking in the initial implementation although some, if we all agree to incorporate, might be harder to retrofit later on.

1. Full debug info fidelity: this is one of our key requirements for the Breakpad processor replacement. For example our developers have been asking for things like the ability to extract the arguments/locals values. It would be fine if we can use GSYM as a first level index which would enable use to pick a subset of the full debug information (having to  fetch the full ELF/DWARF/PDB files would defeat the value of using GSYM)
  1. CFI (must have, w/o this we won’t be able to do even basic stack walks, right?)
  2. “Accelerator” indexes pointing to subsets of the real debug information
  3. Encoding the full debug info (does this even make sense since we’d end up with yet another debug info format)

2. First class support for CodeView/PDBs: as far as I can tell, the proposed GSYM format doesn’t prevent an PDB importer (and I think it’s worth mentioning this in the description). But if we’re talking about adding support for full debug information this might need more consideration.

3. Built-in versioning: I think it’s critical to allow the format to evolve in a backward compatible way.

Miscellaneous Observations, Ideas and Questions
-----------------------------------------------

1. First class support for sharding: for horizontal scalability reasons we’d want to pull in only the information required to process a particular minidump
2. What do you think of using a hierarchical address data structure? (ex. B-Trees. This goes hand in had with sharding)
3. It would be nice to mention that COFF input is also an important use case.
4. Why not use ELF format as the top container/file format?

================
Comment at: lib/DebugInfo/GSYM/README.md:6
+## Why use GSYM?
+GSYM files are up to 7x smaller than DWARF files and up to 3x smaller than Breakpad files. The file format is designed to touch as few pages of the file as possible while doing address lookups. GSYM files can be mmap'ed into a process as shared memory allowing multiple processes on a symbolication server to share loaded GSYM pages. The file format includes inline call stack information and can help turn a single address lookup into multiple stack frames that walk the inlined call stack back to the concrete function that invoked these functions.
+
----------------
just curious, how do you unwind stacks w/o CFI information?

================
Comment at: lib/DebugInfo/GSYM/README.md:59
+### Address Data Offsets Table
+The address data offsets table immediately follows the address table and consists of `Header.num_addrs` 32 bit file offsets: one for each address in the address table. The offsets in this table are the absolute file offset to the address data for each address in the address table. Keeping this data separate from the address table helps to reduce the number of pages that are touched when address lookups occur on a GSYM file.
+
----------------
Why absolute offsets as opposed to relative offsets into the data section? 
1. at very least it makes it easier to manipulate the file format
2. it may also enable short offsets?
3. also consistent with strings offsets

================
Comment at: lib/DebugInfo/GSYM/README.md:83
+
+### String Table
+The string table follows the file table in stand alone GSYM files and contains all strings for everything contained in the GSYM file. Any string data should be added to the string table and any references to strings inside GSYM information must be stored as 32 bit string table offsets into this string table.
----------------
Have you considered sorting the strings + prefix compression? It's an easy way to compress the strings and would avoid the need for special hasing things like directory / filename split in the FileInfo

================
Comment at: lib/DebugInfo/GSYM/README.md:103
+```
+The address data starts with a 32 bit type, followed by a 32 bit length, followed by an array of bytes that encode each specify kind of data.
+The `AddressData.type` is an enumeration value:
----------------
nit: some types of data may have an implicit payload size so the `length` seems wasteful (I'd put as prefix in the type-specific payload instead)

================
Comment at: lib/DebugInfo/GSYM/README.md:106
+```
+enum class InfoType {
+   EndOfList = 0u,
----------------
what about CFI, prologue/epilogue information?

Repository:
  rL LLVM

https://reviews.llvm.org/D53379