[PATCH] D53379: GSYM symbolication format

Thu Jun 6 08:32:58 PDT 2019

clayborg marked 12 inline comments as done.
clayborg added a comment.

In D53379#1531777 <https://reviews.llvm.org/D53379#1531777>, @jakehehrlich wrote:

> This is just where I got tired today but I think I can recommend how to split this up so I could move faster and provide more useful high level review. Prior to splitting I'll keep chugging away for at least a bit each day.
>
> 1. Only add functionality for creating GSym in memory and associated unit tests (no reading from a file).
> 2. Add functionality for reading from a file. If easy enough to do we should ignore inline info in this patch to make it smaller. Add gsymutil in this change and add llvm-lit tests for gsymutil.
> 3. Add functionality for writing to a file.
> 4. Add functionality for reading from breakpad.
> 5. Add functionality for writing to breakpad.
> 6. Add inline info if it wasn't added in 2
> 7. Add functionality for integration into MC

I will work on splitting this patch as requested

> Also it isn't clear to me (again I haven't really gone though anything but the headers) what clear methods or many of the operator overloads are for. Removing unused versions of them would be helpful.

We can go through each thing in each individual patch as I submit them.

================
Comment at: include/llvm/DebugInfo/GSYM/Breakpad.h:20
+
+std::error_code convertBreakpadFileToGSYM(StringRef BreakpadPath,
+                                          StringRef GSYMPath);
----------------
jakehehrlich wrote:
> Its more typical to use Error instead of std::error_code. Error has some sharp edges to it because it forces you to check the error.
sounds good, will switch this over to llvm::Error

================
Comment at: include/llvm/DebugInfo/GSYM/DwarfTransformer.h:29
+public:
+  DwarfTransformer(raw_ostream &OS, uint32_t N = 0) : Log(OS), NumThreads(N) {
+    if (NumThreads == 0)
----------------
jakehehrlich wrote:
> I think I'd prefer this return errors via Error and generally use its interface to allow users to report their own errors. What's the motivation behind passing a log like this?
It was just how we did things for the Facebook code which was outside of llvm since it was a command line only tool. We can have the DwarfTransformer contain a list of errors and warnings and report those after the fact. Would that be better? So maybe have DwarfTransformer have a std::vector<llvm::Error> as a member variable and possibly std::vector<std::string> for warnings?:
```
std::mutex ErrorsWarningsMutex; // Allow mutli-threaded access to Errors and Warnings
std::vector<llvm::Error> Errors;
std::vector<std::string> Warnings;
```

================
Comment at: include/llvm/DebugInfo/GSYM/DwarfTransformer.h:34
+
+  std::error_code loadDwarf(const object::ObjectFile &Obj);
+  std::error_code loadDwarf(StringRef filename) {
----------------
jakehehrlich wrote:
> I think it would be more llvm-ish to use Error here but you should be able to swap between the two.
> 
> Also I think having methods that finish constructing isn't ideal. Ideally this would happen in the constructor but since we have to handle errors perhaps having a private "incomplete" constructor like this one, make these methods private, and then provide a static function that returns an Expected<DwarfTransformer> by first calling the incomplete constructor and then completing the object using these methods.
llvm::Error is fine. Also fine to move things around as needed and uses static creation methods.

================
Comment at: include/llvm/DebugInfo/GSYM/DwarfTransformer.h:42
+  std::error_code loadSymbolTable(const object::ObjectFile &Obj);
+  std::error_code loadSymbolTable(StringRef filename) {
+    if (auto Binary = getObjectFile(filename))
----------------
jakehehrlich wrote:
> Perhaps this is part of the reason for using these methods but it seems really inefficient to reload the file when loadDwarf needs to be called anyway. Is there a use case for that?
Some of this complexity I believe came from the different Facebook internal sources that had sharding built in. I didn't want to add sharding (break up one GSYM file into multiple parts) in the first check-in so some of this is left over from that. Also some of the work was done by a non llvm person after I checked in my sources. As you said before, many of these functions should be private and or could be merged into a single function.

================
Comment at: include/llvm/DebugInfo/GSYM/FileEntry.h:23
+struct FileEntry {
+  uint32_t Dir = 0;  // String table offset in the string table
+  uint32_t Base = 0; // String table offset in the string table
----------------
jakehehrlich wrote:
> How do we want to handle endianness? Are we just assuming that we'll only be working on the target system? Is it always little endian?
> 
> The standard thing to do would be to use https://llvm.org/doxygen/Endian_8h_source.html and use packed_endian_specific_integral. This has been extremely successful in llvm.
If we are going to encode into object files, we need to have the magic value in the header tell us the byte order IMHO. We really want it to be the same as the system that will use since we want to mmap the file into memory and use it as efficiently as possible. Not sure how well things would perform if we forced an byte order on the file using things like llvm::support::ulittle32_t.

================
Comment at: include/llvm/DebugInfo/GSYM/FileWriter.h:22
+class FileWriter {
+  std::ostream &OS;
+
----------------
jakehehrlich wrote:
> Weather raw_ostream or something like FileOutputBuffer is used varies across LLVM. I generally prefer using FileOutputBuffer for binary output. FileOutputBuffer has the unfortunate problem that it doesn't  have an abstraction that lets you choose between using MemoryBuffer or other such things that do have abstractions otherwise this would be a fairly simple choice. Also if you want to stream output for memory reasons you might prefer to use a raw_ostream.
> 
> The general pattern I've observed for overcoming the abstraction short comings of FileOutputBuffer is to make 'write' methods accept a uint8_t* instead. The consequence is that you often need a reinterpret_cast.
> 
> Is there a reason an ostream was used here instead of a raw_ostream?
No reason. Happy to switch. Just trying to avoid the ASMWriter as it requires so much of llvm (targets and more) to be loaded and made making a 32 bit or 64 bit file with a byte order much harder as you had to match it up with a target just to get those values to match.

================
Comment at: include/llvm/DebugInfo/GSYM/GsymCreator.h:48
+  uint32_t insertString(StringRef S) {
+    std::lock_guard<std::mutex> Guard(Mutex);
+    return StrTab.insert(S.str());
----------------
jakehehrlich wrote:
> Can you comment on the expected paralell use here? I saw above that you expect to use multiple threads to create the gsym data but it isn't clear to me that this sort of thing makes sense if the threads are going to be under constant contention. Will many threads by calling these methods rapidly or will each thread do a bunch of work between calls? I'd expect the former which makes me skeptical. Have you benchmarked this?
Doing each compile unit in DWARF separately does speed things up when we tested really large binaries here at Facebook. Happy to try it both ways to make sure. But parsing DWARF is generally grabbing a DIE, getting its address ranges, if it has valid ranges, then parse the line table entries that are only the line entries for the function itself, create the inline info, go onto next stage. The string population happens after we get the function ranges if they are valid, and during the line table parsing. We can easily cache a DWARF file index to GSYM file index if that isn't already being done since the first time we add a file from DWARF to GSYM we will need to unique the directory and basename strings, but we should only need to do that once per file index for compile unit line table file. So I believe this will work out as there is plenty of work to do between

================
Comment at: include/llvm/DebugInfo/GSYM/GsymCreator.h:74
+    std::lock_guard<std::mutex> Guard(Mutex);
+    std::sort(Funcs.begin(), Funcs.end());
+  }
----------------
mgrang wrote:
> Please use range based llvm::sort instead of std::sort.
> ```
> llvm::sort(Funcs);
> ```
> 
> See https://llvm.org/docs/CodingStandards.html#beware-of-non-deterministic-sorting-order-of-equal-elements for more details.
will do!

================
Comment at: include/llvm/DebugInfo/GSYM/GsymReader.h:64
+
+  std::error_code openFile(StringRef Filename);
+  void copyBuffer(StringRef Bytes);
----------------
jakehehrlich wrote:
> Some comment about the 'loadFoo' things above. I'd expect there to be a create function that returns an Expected<GsymReader>
will do

================
Comment at: include/llvm/DebugInfo/GSYM/GsymReader.h:68
+  const Header *getHeader() const { return GSYMHeader; }
+  void dump(llvm::raw_ostream &OS, bool Verbose) const;
+  // Dump any address info with matching name
----------------
jakehehrlich wrote:
> comment what Verbose does, it isn't clear to me.
will do

================
Comment at: include/llvm/DebugInfo/GSYM/Range.h:34
+  uint64_t startAddress() const { return Start; }
+  uint64_t endAddress() const { return End; }
+  void clear() {
----------------
I like them in case I switch the contents to be "uint64_t Start; uint64_t Size;". Just allows less code changes if we switch this around. I know it is already a struct, so if I do this, this should be a class where "Start" and "End" are private.

================
Comment at: include/llvm/DebugInfo/GSYM/StringTableCreator.h:20
+namespace gsym {
+class StringTableCreator {
+  StringMap<uint32_t> Strings;
----------------
jakehehrlich wrote:
> I almost commented on this issue during the meeting. The standard way these are built is using StringTableBuilder in MC. It uses a finalization technique that makes things a bit more difficult because you have to request the indexes *after* finalization. This enables the standard string table compression technique for shared suffixes.
> 
> Switching to that might however cause your interface problems. It would be nice to switch but this might require a large architectural switch. It seems a shame that a chosen interface would overrule a filesize optimization however.
Yeah, I believe I tried to avoid the MC layer as is required llvm targets to be available and you had to pick an architecture to match the address byte size and byte ordering. Happy to use other things from LLVM where possible and not to large of a pain. But not getting an offsets right away seems like a shame and would change a lot of things to require fixups before they were emitted (all function names in FunctionInfo, directories and basenames in FileEntry object, inline function infos) etc. And we might need a special class still if we ever emit into an exising object file because we would want to be able to reuse and strings in .debug_str or other ELF string tables.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D53379/new/

https://reviews.llvm.org/D53379