[llvm-dev] [RFC] Thoughts on a bitcode symbol table

Fri May 27 08:48:36 PDT 2016

This is about https://llvm.org/bugs/show_bug.cgi?id=27551.

Currently there is no easy way to get symbol information out of
bitcode files. One has to read the module and mangle the names. This
has a few problem

* During lto we have to create the Module earlier.
* There is no convenient spot to store flags/summary.
* Simpler tools like llvm-nm have massive dependencies because Object
depends on MC to find asm defined symbols.

To fix this I think we need a symbol table. The desired properties are

* Include the *final* name of symbols (_foo, not foo).
* Not be compressed so that be can keep StringRefs to the names.
* Be easy to parse without a LLVMContext.
* Include names created by inline assembly.
* Include other information a linker or nm would want: linkage,
visbility, comdat

The first question is: where should we store it? Some options I thought about:

* Use the existing support for putting bitcode in a section of a
native file and use the file's symbol table.
* Use a custom wrapper over the .bc
* Encode it with records/blocks in the .bc

The first option would be a bit annoying as we are sure to want to
represent more than the native files have. It is also a bit odd for
cross compiling. Do we create a MachO when the bitcode is for darwin
and an ELF when it is for Linux? It would also mean that llvm-as would
depend on a library to create these files.

The second option is tempting for parsing simplicity, but introduces
duplication as the names for regular global values would be stored
twice (once mangled, once not). The symbol table would also use a
string table, which is a concept I think would improve the .bc format.

So my current preference is for the last one. Encode the symbol table
in the .bc. This means that lib/Object will depend on BitReader, but
not more than that.

The next issue is what to do with .ll files. One option is to change
nothing and have llvm-as parse module level inline asm to crete symbol
entries. That would work, but sounds odd. I think we need directives
in the .ll so that symbols created or used by inline asm can be
declared.

Yet another issue is how to handle a string table in .bc. The problem
is not with the format, it is with StreamingMemoryObject. We have to
keep the string table alive while the rest of the file is read, and
the StreamingMemoryObject can reallocate the buffer.

I can think of two solutions

* Drop it. The one known user is PNaCl and it is moving to subzero, so
it is not clear if this is still needed.

* Change the representation so that each read is required to be
contiguous and not be freed. It would basically store a vector of
std::pair<offset, char*> and we would make sure the string table is
read as a blob in a single read.

With all that sorted, I think the representation can be fairly simple:

* a top level record stores the string table as a single blob. This
can be used for any string in the .bc, not just the symbol table.
* a sub block contains the symbol table with one record per symbol. It
would include an offset in the string table, the name size, the
linkage, etc. Being a record makes it easy to extend.

Cheers,
Rafael