[PATCH] D53379: GSYM symbolication format

Fri Oct 26 09:58:54 PDT 2018

clayborg added inline comments.

================
Comment at: lib/DebugInfo/GSYM/README.md:6
+## Why use GSYM?
+GSYM files are up to 7x smaller than DWARF files and up to 3x smaller than Breakpad files. The file format is designed to touch as few pages of the file as possible while doing address lookups. GSYM files can be mmap'ed into a process as shared memory allowing multiple processes on a symbolication server to share loaded GSYM pages. The file format includes inline call stack information and can help turn a single address lookup into multiple stack frames that walk the inlined call stack back to the concrete function that invoked these functions.
+
----------------
lemo wrote:
> just curious, how do you unwind stacks w/o CFI information?
We need to add unwind info at some point for sure. It is one of the payloads in the address info we can work on once this is in.

================
Comment at: lib/DebugInfo/GSYM/README.md:59
+### Address Data Offsets Table
+The address data offsets table immediately follows the address table and consists of `Header.num_addrs` 32 bit file offsets: one for each address in the address table. The offsets in this table are the absolute file offset to the address data for each address in the address table. Keeping this data separate from the address table helps to reduce the number of pages that are touched when address lookups occur on a GSYM file.
+
----------------
lemo wrote:
> Why absolute offsets as opposed to relative offsets into the data section? 
> 1. at very least it makes it easier to manipulate the file format
> 2. it may also enable short offsets?
> 3. also consistent with strings offsets
We need to be able to binary search this table for your address. If we use relative offsets, then we can't do that. The idea is to mmap this file and use the data as is with minimal setup.

================
Comment at: lib/DebugInfo/GSYM/README.md:83
+
+### String Table
+The string table follows the file table in stand alone GSYM files and contains all strings for everything contained in the GSYM file. Any string data should be added to the string table and any references to strings inside GSYM information must be stored as 32 bit string table offsets into this string table.
----------------
lemo wrote:
> Have you considered sorting the strings + prefix compression? It's an easy way to compress the strings and would avoid the need for special hasing things like directory / filename split in the FileInfo
I haven't really done much optimization on paths other that split them into directory and filename so file entries can share the strings. One thing we could do is allow strings to be specified in the string table with a length for the file table. That way we could have a long path: /a/b/c/d and refer to "/a", "/a/b", "/a/b/c" and "/a/b/c/d" using the same string. I am open to ideas here. I kept it simple to start with.

================
Comment at: lib/DebugInfo/GSYM/README.md:103
+```
+The address data starts with a 32 bit type, followed by a 32 bit length, followed by an array of bytes that encode each specify kind of data.
+The `AddressData.type` is an enumeration value:
----------------
lemo wrote:
> nit: some types of data may have an implicit payload size so the `length` seems wasteful (I'd put as prefix in the type-specific payload instead)
I think having the length defined is essential to the format. It allows you to skip any data you don't care about with knowing what it contains.

================
Comment at: lib/DebugInfo/GSYM/README.md:106
+```
+enum class InfoType {
+   EndOfList = 0u,
----------------
lemo wrote:
> what about CFI, prologue/epilogue information?
This is designed to be extensible and I hope to see many more types of info added in the future. 

Repository:
  rL LLVM

https://reviews.llvm.org/D53379