[llvm-dev] [RFC] Thoughts on a bitcode symbol table

Fri May 27 19:31:38 PDT 2016

Hi Rafael

Thanks for bringing this up.  libObject linking libCore is something I’ve been hoping someone could find a way to fix.

The plan as you’ve described sounds good to me.

One thing I had considered when I looked at the code was whether it would make sense to have a base class in BitReader which can just read a SymbolicIRFile.  In libObject, IRObjectFile inherits from SymbolFile as we only really want the symbols from it.  It would be interesting to see if BitReader could mirror this.  Then we could use the IR-less Symbolic BitReader from libObject to just crack the symbol table.

Anyway, not something we necessarily need immediately, but would be interesting to see if one day we can do more in BitReader without creating IR.  I think this is what you were alluding to when you said you shouldn’t need an LLVMContext.

Cheers,
Pete
> On May 27, 2016, at 8:48 AM, Rafael Espíndola via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> This is about https://llvm.org/bugs/show_bug.cgi?id=27551.
> 
> Currently there is no easy way to get symbol information out of
> bitcode files. One has to read the module and mangle the names. This
> has a few problem
> 
> * During lto we have to create the Module earlier.
> * There is no convenient spot to store flags/summary.
> * Simpler tools like llvm-nm have massive dependencies because Object
> depends on MC to find asm defined symbols.
> 
> To fix this I think we need a symbol table. The desired properties are
> 
> * Include the *final* name of symbols (_foo, not foo).
> * Not be compressed so that be can keep StringRefs to the names.
> * Be easy to parse without a LLVMContext.
> * Include names created by inline assembly.
> * Include other information a linker or nm would want: linkage,
> visbility, comdat
> 
> The first question is: where should we store it? Some options I thought about:
> 
> * Use the existing support for putting bitcode in a section of a
> native file and use the file's symbol table.
> * Use a custom wrapper over the .bc
> * Encode it with records/blocks in the .bc
> 
> The first option would be a bit annoying as we are sure to want to
> represent more than the native files have. It is also a bit odd for
> cross compiling. Do we create a MachO when the bitcode is for darwin
> and an ELF when it is for Linux? It would also mean that llvm-as would
> depend on a library to create these files.
> 
> The second option is tempting for parsing simplicity, but introduces
> duplication as the names for regular global values would be stored
> twice (once mangled, once not). The symbol table would also use a
> string table, which is a concept I think would improve the .bc format.
> 
> So my current preference is for the last one. Encode the symbol table
> in the .bc. This means that lib/Object will depend on BitReader, but
> not more than that.
> 
> The next issue is what to do with .ll files. One option is to change
> nothing and have llvm-as parse module level inline asm to crete symbol
> entries. That would work, but sounds odd. I think we need directives
> in the .ll so that symbols created or used by inline asm can be
> declared.
> 
> Yet another issue is how to handle a string table in .bc. The problem
> is not with the format, it is with StreamingMemoryObject. We have to
> keep the string table alive while the rest of the file is read, and
> the StreamingMemoryObject can reallocate the buffer.
> 
> I can think of two solutions
> 
> * Drop it. The one known user is PNaCl and it is moving to subzero, so
> it is not clear if this is still needed.
> 
> * Change the representation so that each read is required to be
> contiguous and not be freed. It would basically store a vector of
> std::pair<offset, char*> and we would make sure the string table is
> read as a blob in a single read.
> 
> With all that sorted, I think the representation can be fairly simple:
> 
> * a top level record stores the string table as a single blob. This
> can be used for any string in the .bc, not just the symbol table.
> * a sub block contains the symbol table with one record per symbol. It
> would include an offset in the string table, the name size, the
> linkage, etc. Being a record makes it easy to extend.
> 
> Cheers,
> Rafael
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev