[PATCH] D53379: GSYM symbolication format

Fri Oct 26 14:59:31 PDT 2018

clayborg added a comment.

In https://reviews.llvm.org/D53379#1271305, @lemo wrote:

> +Eric Christopher for a DWARF expert's perspective
>
> Hi Greg, this is great stuff. I’m going to take a closer look at the implementation, in the meantime here are a few high level comments:
>
> Important requirements
>
> These observations describe areas which would prevent or significantly limit the use of GSYM format for crash processing.  I’m not suggesting that these are blockers for checking in the initial implementation although some, if we all agree to incorporate, might be harder to retrofit later on.
>
> - Full debug info fidelity: this is one of our key requirements for the Breakpad processor replacement. For example our developers have been asking for things like the ability to extract the arguments/locals values. It would be fine if we can use GSYM as a first level index which would enable use to pick a subset of the full debug information (having to  fetch the full ELF/DWARF/PDB files would defeat the value of using GSYM)
>   - CFI (must have, w/o this we won’t be able to do even basic stack walks, right?)
>   - “Accelerator” indexes pointing to subsets of the real debug information
>   - Encoding the full debug info (does this even make sense since we’d end up with yet another debug info format)

DWARF is good for full debug info so I see no need to reinvent here. We can put the GSYM into a section and also have any DWARF we need for full debug info. We could also embed DWARF into the address info data if needed. Many options to achieve the debug info requirements.

I see CFI as the next thing to get added to GSYM. Lets work on the format we want. for me I want the following from CFI:

- Async unwind info so we can unwind at any PC
- If we can get the above info, we at least know if the unwind info is only valid at call sites
- No need to re-invent here. We can just point to existing unwind info in the file (like say in EH frame), or we can inline existing formats (standard EH frame, .debug_frame, compact unwind from ARM, etc)

For accelerator tables, this file format is an address accelerator table. For other info, we can add new InfoType enumeration values that can point into existing DWARF (.debug_info, .debug_frame, and more)

> - First class support for CodeView/PDBs: as far as I can tell, the proposed GSYM format doesn’t prevent an PDB importer (and I think it’s worth mentioning this in the description). But if we’re talking about adding support for full debug information this might need more consideration.

Yes, we should add a pdb2Dwarf converter like we do with DWARF. Very easy to add for sure.

> - Built-in versioning: I think it’s critical to allow the format to evolve in a backward compatible way.

The header has a version. We can include a version in each InfoType as well

> Miscellaneous Observations, Ideas and Questions
> 
> - First class support for sharding: for horizontal scalability reasons we’d want to pull in only the information required to process a particular minidump

I pulled the sharding (segmenting) that was in the initial checkin until we lock down all of the changes. I will re-add this as soon as the first check-in happens.

> - What do you think of using a hierarchical address data structure? (ex. B-Trees. This goes hand in had with sharding)

Anything that makes the lookups faster is fine with me. It would be easy to embed a reference to another GSYM file within a GSYM file as a new InfoType!

> - It would be nice to mention that COFF input is also an important use case.

Yeah, easy to use COFF with any of this.

> - Why not use ELF format as the top container/file format?

We can, but not required. GSYM can be a stand alone file, or a section within an ELF file. Much quicker to map in a dedicated GSYM file and then just start parsing IMHO. But we have the option for both. Really works well if we can point the GSYM string table at the .debug_str so we can share strings. The string table offset is an absolute file offset so that it can shared string tables with other sections. It can even share multiple string tables by pointing to the lower file offset for the first line table and giving a large size and spans multiple areas of the file. We might need to dump the string table offset to 64 bit.

https://reviews.llvm.org/D53379