[PATCH] D53379: GSYM symbolication format

Tue Feb 26 08:30:38 PST 2019

clayborg added a comment.
Herald added a subscriber: jdoerfert.

In D53379#1376627 <https://reviews.llvm.org/D53379#1376627>, @echristo wrote:

> Hi Greg,
>
> I've had a lot of time to review this (thanks for that) and do apologize for taking so long. I have a couple of concerns about this so bear with me and let's see where we can get:
>
> a) This looks like it's a standalone tool that's being added to llvm, but that really doesn't involve anything coming out of llvm right now?

It has a few tools:
dwarf2gsym
bpad2gsym

It also has all of the details for parsing and creating the format.

The dwarf2gsym uses the LLVM DWARF parser to parse and convert DWARF to GSYM format so that is a huge part of LLVM that is being used.

bpad2gsym converts the textual breakpad format to GSYM. There are many servers out there that are using very large breakpad, a google project, text files for symbolication which wastes a ton of CPU time as the file format is a big blob of text. So seeing as breakpad and crashpad want to adopt this format, it seemed like LLVM was a good place to put it so that it can get adopted by these Google teams. They already have a DWARF to breakpad conversion tool out there somewhere. That tool might have its own DWARF parser, which seems like a waste to not share the very nice LLVM DWARF parser because it keeps up with the standard more. We had breakpad users at Facebook having to fix DWARF parsing bugs as DWARF moved to DWARF4 recently and I was surprised to find a tool that had its own DWARF parse. So sharing of LLVM technologies seemed to make more sense seeing as Google folks own breakpad and crashpad.

> b) This seems to be largely a binary encoding of a breakpad file and not a new debug format?

It is a more efficiently encoded symbolication format for address to information (source file and line, and inline call stack). It isn't a replacement for DWARF, but it can be a complete replacement for any users of -gline-tables-only. It is designed to allow crash reporting tools and servers that are parsing millions of symbolication requests to symbolicate many orders of magnitude faster than using DWARF. It is designed to be mmap'ed shared by one or more processes and used as is (no setup, or sorting the DWARF "accelerator" tables (which are random indexes)). Unlike DWARF, we can mmap this in and use it much like the apple accelerator tables. All information for each function is in a single blob of data, where in DWARF it is scattered across  .debug_info, .debug_line, .debug_abbrev, .debug_ranges, and more sections making symbolication very expensive (file cache and performance) when using DWARF. With DWARF, you must check .debug_aranges for the address after parsing all .debug_aranges and sorting the random address list, or linearly search all CUs for their DW_AT_ranges, then if you find a CU, parse ALL DWARF for that CU till you find the function info that is correct, then go parse the line table for the entire CU, and pull out just the bits you cared about for the function.

> c) What are the future plans for this code?

A few things I can think of:

- have compiler add the GSYM data in as a section when compiling and linking. The GSYM data can share the string table from the .debug_str or the symbol table, so this information can be added in along with DWARF to get much better symbolication performance alongside other DWARF and debug info data.
- replace -gline-tables-only with this for better performance or symbolication
- add unwind information to the address info to allow symbolication tools that might be doing stack backtraces in process, or external tools to backtrace correctly when given async unwind info for a function. WE can also specify if the information is asynchronous so we can trust it to unwind first frames, or if it isn't only unwind non first frames or non sigtramp following frames
- Add DWARF DIE offset info to the address info for each address to allow this to be used as a better address accelerator table. Right now DWARF .debug_aranges are just random addresses to CU offset (not DIE offset).
- use this format more in profiling tools that might need to backtrace or gather data. We saved thousands of machines by switching to GSYM here at Facebook for symbolication and for real time CPU profiling data
- possibly get this accepted into DWARF format as a replacement for .debug_aranges?

> Mostly I'm trying to figure out what we want to do with this in llvm. It seems like something that would be good for the breakpad project mostly?

>From the above stuff I hope you might be able to see where we go with this. But this format applies to anyone wanting to do very quick address to data lookups. mmap in, use the tables, better line table encoding than DWARF (we have a single file table where DWARF has one per source file), get inline call stack unwinding in cases where you want to symbolicate.

My idea behind putting it into LLVM allows any compiler that uses LLVM to add this accelerator table as a section in their .o files, their linked executables, or make stand alone GSYM files for server symbolication. The dwarf2gsym conversion tool leverages the LLVM DWARF parser to convert DWARF to this format for people that aren't able to build it into their .o files or binaries at compile/link time, but I would love to see this format be able to be added to .o files and symbol files during build time.

Let me know what you think. I would be happy to meet at Google to discuss further for lunch, or have any folks come up to Facebook. Let me know what you think.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D53379/new/

https://reviews.llvm.org/D53379