[PATCH] D53379: GSYM symbolication format

Wed Jun 5 17:33:17 PDT 2019

jakehehrlich added a comment.

This is just where I got tired today but I think I can recommend how to split this up so I could move faster and provide more useful high level review. Prior to splitting I'll keep chugging away for at least a bit each day.

1. Only add functionality for creating GSym in memory and associated unit tests (no reading from a file).
2. Add functionality for reading from a file. If easy enough to do we should ignore inline info in this patch to make it smaller. Add gsymutil in this change and add llvm-lit tests for gsymutil.
3. Add functionality for writing to a file.
4. Add functionality for reading from breakpad.
5. Add functionality for writing to breakpad.
6. Add inline info if it wasn't added in 2
7. Add functionality for integration into MC

Also it isn't clear to me (again I haven't really gone though anything but the headers) what clear methods or many of the operator overloads are for. Removing unused versions of them would be helpful.

================
Comment at: include/llvm/DebugInfo/GSYM/Breakpad.h:20
+
+std::error_code convertBreakpadFileToGSYM(StringRef BreakpadPath,
+                                          StringRef GSYMPath);
----------------
Its more typical to use Error instead of std::error_code. Error has some sharp edges to it because it forces you to check the error.

================
Comment at: include/llvm/DebugInfo/GSYM/DwarfTransformer.h:29
+public:
+  DwarfTransformer(raw_ostream &OS, uint32_t N = 0) : Log(OS), NumThreads(N) {
+    if (NumThreads == 0)
----------------
I think I'd prefer this return errors via Error and generally use its interface to allow users to report their own errors. What's the motivation behind passing a log like this?

================
Comment at: include/llvm/DebugInfo/GSYM/DwarfTransformer.h:34
+
+  std::error_code loadDwarf(const object::ObjectFile &Obj);
+  std::error_code loadDwarf(StringRef filename) {
----------------
I think it would be more llvm-ish to use Error here but you should be able to swap between the two.

Also I think having methods that finish constructing isn't ideal. Ideally this would happen in the constructor but since we have to handle errors perhaps having a private "incomplete" constructor like this one, make these methods private, and then provide a static function that returns an Expected<DwarfTransformer> by first calling the incomplete constructor and then completing the object using these methods.

================
Comment at: include/llvm/DebugInfo/GSYM/DwarfTransformer.h:42
+  std::error_code loadSymbolTable(const object::ObjectFile &Obj);
+  std::error_code loadSymbolTable(StringRef filename) {
+    if (auto Binary = getObjectFile(filename))
----------------
Perhaps this is part of the reason for using these methods but it seems really inefficient to reload the file when loadDwarf needs to be called anyway. Is there a use case for that?

================
Comment at: include/llvm/DebugInfo/GSYM/FileEntry.h:23
+struct FileEntry {
+  uint32_t Dir = 0;  // String table offset in the string table
+  uint32_t Base = 0; // String table offset in the string table
----------------
How do we want to handle endianness? Are we just assuming that we'll only be working on the target system? Is it always little endian?

The standard thing to do would be to use https://llvm.org/doxygen/Endian_8h_source.html and use packed_endian_specific_integral. This has been extremely successful in llvm.

================
Comment at: include/llvm/DebugInfo/GSYM/FileEntry.h:46
+  static unsigned getHashValue(const gsym::FileEntry &Val) {
+    return Val.Dir * 37U + Val.Base * 37U;
+  }
----------------
nit: You could use llvm::hash_combine

================
Comment at: include/llvm/DebugInfo/GSYM/FileWriter.h:22
+class FileWriter {
+  std::ostream &OS;
+
----------------
Weather raw_ostream or something like FileOutputBuffer is used varies across LLVM. I generally prefer using FileOutputBuffer for binary output. FileOutputBuffer has the unfortunate problem that it doesn't  have an abstraction that lets you choose between using MemoryBuffer or other such things that do have abstractions otherwise this would be a fairly simple choice. Also if you want to stream output for memory reasons you might prefer to use a raw_ostream.

The general pattern I've observed for overcoming the abstraction short comings of FileOutputBuffer is to make 'write' methods accept a uint8_t* instead. The consequence is that you often need a reinterpret_cast.

Is there a reason an ostream was used here instead of a raw_ostream?

================
Comment at: include/llvm/DebugInfo/GSYM/GsymCreator.h:48
+  uint32_t insertString(StringRef S) {
+    std::lock_guard<std::mutex> Guard(Mutex);
+    return StrTab.insert(S.str());
----------------
Can you comment on the expected paralell use here? I saw above that you expect to use multiple threads to create the gsym data but it isn't clear to me that this sort of thing makes sense if the threads are going to be under constant contention. Will many threads by calling these methods rapidly or will each thread do a bunch of work between calls? I'd expect the former which makes me skeptical. Have you benchmarked this?

================
Comment at: include/llvm/DebugInfo/GSYM/GsymReader.h:64
+
+  std::error_code openFile(StringRef Filename);
+  void copyBuffer(StringRef Bytes);
----------------
Some comment about the 'loadFoo' things above. I'd expect there to be a create function that returns an Expected<GsymReader>

================
Comment at: include/llvm/DebugInfo/GSYM/GsymReader.h:68
+  const Header *getHeader() const { return GSYMHeader; }
+  void dump(llvm::raw_ostream &OS, bool Verbose) const;
+  // Dump any address info with matching name
----------------
comment what Verbose does, it isn't clear to me.

================
Comment at: include/llvm/DebugInfo/GSYM/Range.h:30-34
+  void setStartAddress(uint64_t Addr) { Start = Addr; }
+  void setEndAddress(uint64_t Addr) { End = Addr; }
+  void setSize(uint64_t Size) { End = Start + Size; }
+  uint64_t startAddress() const { return Start; }
+  uint64_t endAddress() const { return End; }
----------------
nit: These sorts of methods seem superfluous to me but don't worry about removing them.

================
Comment at: include/llvm/DebugInfo/GSYM/StringTableCreator.h:20
+namespace gsym {
+class StringTableCreator {
+  StringMap<uint32_t> Strings;
----------------
I almost commented on this issue during the meeting. The standard way these are built is using StringTableBuilder in MC. It uses a finalization technique that makes things a bit more difficult because you have to request the indexes *after* finalization. This enables the standard string table compression technique for shared suffixes.

Switching to that might however cause your interface problems. It would be nice to switch but this might require a large architectural switch. It seems a shame that a chosen interface would overrule a filesize optimization however.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D53379/new/

https://reviews.llvm.org/D53379