[llvm-dev] RFC: Adding a string table to the bitcode format

Tue Apr 4 12:12:36 PDT 2017

On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>
> Hi,
>
> As part of PR27551 I want to add a string table to the bitcode format to
> allow global value and comdat names to be shared with the proposed symbol
> table (and, as side effects, allow comdat names to be shared with value
> names, make bitcode files more compressible and make bitcode easier to
> parse). The format of the string table would be a top-level block
> containing a blob containing null-terminated strings [0] similar to the
> string table format used in most object files.
>
>
>
> I’m in favor of this, but note that currently string can be encoded with
> less than 8 bits / char in some cases (there might some size increase
> because of this).
>

Sure, but I think we need to make the right tradeoff between making data
more efficient to read and using fewer bits. In this case I think the right
tradeoff is clearly in favour of being efficient to read, because accessing
it is in the critical path of a consumer (i.e. a linker), and the part that
needs to be efficient to read is a relatively small part of the data in the
bitcode file. The same logic applies to the symbol table (note that we use
support::ulittle32_t instead of a bit encoding).

That said we already paid this with the metadata table in the recent past
> for example.
>

> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
> records would change so that their first operand would specify their names
> with a byte offset into the string table. (To allow for backwards
> compatibility, I would increment the bitcode version.)
>
>
> I assume you mean the EPOCH?
>

No, the MODULE_CODE_VERSION.
http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
It isn't clear to me why we have both.

> Here is what it would look like as bcanalyzer output:
>
> <MODULE_BLOCK>
>   <VERSION op0=2>
>   <COMDAT op0=0 ...> ; name = foo
>   <FUNCTION op0=0 ...> ; name = foo
>   <GLOBALVAR op0=4 ...> ; name = bar
>   <ALIAS op0=8 ...> ; name = baz
>  ; function bodies, etc.
> </MODULE_BLOCK>
> <STRTAB_BLOCK>
>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
> </STRTAB_BLOCK>
>
>
> Why is the string table after the module instead of before?
>

For implementation simplicity. The idea is that the BitcodeWriter would
have a member of type StringTableBuilder which would accumulate strings
while writing the bitcode module(s) (and symtab in the future). At the end,
the client would call something like BitcodeWriter::writeStrtab() which
would write out the string table.

>
> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means
> that bitcode files can continue to be concatenated with "llvm-cat -b".
> (Normally bitcode files would contain a single string table, which in
> multi-module bitcode files would be shared between modules.)
>
> This *almost* allows us to remove the global (top-level) VST entirely, if
> not for the function offset in the FNENTRY record. However, this offset is
> not actually required because we can scan the module's FUNCTION_BLOCK_IDs
> as we were doing before http://reviews.llvm.org/D12536 (this may have a
> performance impact, so I'll measure it first).
>
> Assuming that performance looks good, does this seem reasonable to folks?
>
>
>
> I rather seek to have a symbol table that entirely replace the VST, kee.
> If there is a perf impact with the FNENTRY offset, why can’t it be
> replicated in the symbol table?
>

Sure, we could in principle store function offsets in the symbol table as
well, if that helps with performance. But I want to measure the impact and
find out whether that is actually the case first.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/7810ab51/attachment.html>