[llvm-dev] RFC: Adding a string table to the bitcode format

Peter Collingbourne via llvm-dev llvm-dev at lists.llvm.org
Tue Apr 4 12:12:36 PDT 2017

On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
> Hi,
> As part of PR27551 I want to add a string table to the bitcode format to
> allow global value and comdat names to be shared with the proposed symbol
> table (and, as side effects, allow comdat names to be shared with value
> names, make bitcode files more compressible and make bitcode easier to
> parse). The format of the string table would be a top-level block
> containing a blob containing null-terminated strings [0] similar to the
> string table format used in most object files.
> I’m in favor of this, but note that currently string can be encoded with
> less than 8 bits / char in some cases (there might some size increase
> because of this).

Sure, but I think we need to make the right tradeoff between making data
more efficient to read and using fewer bits. In this case I think the right
tradeoff is clearly in favour of being efficient to read, because accessing
it is in the critical path of a consumer (i.e. a linker), and the part that
needs to be efficient to read is a relatively small part of the data in the
bitcode file. The same logic applies to the symbol table (note that we use
support::ulittle32_t instead of a bit encoding).

That said we already paid this with the metadata table in the recent past
> for example.

> records would change so that their first operand would specify their names
> with a byte offset into the string table. (To allow for backwards
> compatibility, I would increment the bitcode version.)
> I assume you mean the EPOCH?

It isn't clear to me why we have both.

> Here is what it would look like as bcanalyzer output:
>   <VERSION op0=2>
>   <COMDAT op0=0 ...> ; name = foo
>   <FUNCTION op0=0 ...> ; name = foo
>   <GLOBALVAR op0=4 ...> ; name = bar
>   <ALIAS op0=8 ...> ; name = baz
>  ; function bodies, etc.
>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
> Why is the string table after the module instead of before?

For implementation simplicity. The idea is that the BitcodeWriter would
have a member of type StringTableBuilder which would accumulate strings
while writing the bitcode module(s) (and symtab in the future). At the end,
the client would call something like BitcodeWriter::writeStrtab() which
would write out the string table.

> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means
> that bitcode files can continue to be concatenated with "llvm-cat -b".
> (Normally bitcode files would contain a single string table, which in
> multi-module bitcode files would be shared between modules.)
> This *almost* allows us to remove the global (top-level) VST entirely, if
> not for the function offset in the FNENTRY record. However, this offset is
> not actually required because we can scan the module's FUNCTION_BLOCK_IDs
> as we were doing before http://reviews.llvm.org/D12536 (this may have a
> performance impact, so I'll measure it first).
> Assuming that performance looks good, does this seem reasonable to folks?
> I rather seek to have a symbol table that entirely replace the VST, kee.
> If there is a perf impact with the FNENTRY offset, why can’t it be
> replicated in the symbol table?

Sure, we could in principle store function offsets in the symbol table as
well, if that helps with performance. But I want to measure the impact and
find out whether that is actually the case first.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/7810ab51/attachment.html>

More information about the llvm-dev mailing list