[llvm-dev] RFC: Adding a string table to the bitcode format

Peter Collingbourne via llvm-dev llvm-dev at lists.llvm.org
Tue Apr 4 13:11:10 PDT 2017


On Tue, Apr 4, 2017 at 12:36 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:

>
> On 2017-Apr-04, at 12:12, Peter Collingbourne <peter at pcc.me.uk> wrote:
>
> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
>>
>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>>
>> Hi,
>>
>> As part of PR27551 I want to add a string table to the bitcode format to
>> allow global value and comdat names to be shared with the proposed symbol
>> table (and, as side effects, allow comdat names to be shared with value
>> names, make bitcode files more compressible and make bitcode easier to
>> parse). The format of the string table would be a top-level block
>> containing a blob containing null-terminated strings [0] similar to the
>> string table format used in most object files.
>>
>>
>>
>> I’m in favor of this, but note that currently string can be encoded with
>> less than 8 bits / char in some cases (there might some size increase
>> because of this).
>>
>
> Sure, but I think we need to make the right tradeoff between making data
> more efficient to read and using fewer bits. In this case I think the right
> tradeoff is clearly in favour of being efficient to read, because accessing
> it is in the critical path of a consumer (i.e. a linker), and the part that
> needs to be efficient to read is a relatively small part of the data in the
> bitcode file. The same logic applies to the symbol table (note that we use
> support::ulittle32_t instead of a bit encoding).
>
>
> The same logic might suggest storing string sizes (as a prefix) to avoid
> scanning for string lengths.
>

It might do, keeping in mind that reading pretty much every existing object
file format already requires scanning for string lengths. Certainly
something to try and evaluate, at least.

Peter


>
> That said we already paid this with the metadata table in the recent past
>> for example.
>>
>
>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
>> records would change so that their first operand would specify their names
>> with a byte offset into the string table. (To allow for backwards
>> compatibility, I would increment the bitcode version.)
>>
>>
>> I assume you mean the EPOCH?
>>
>
> No, the MODULE_CODE_VERSION.
> http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
> It isn't clear to me why we have both.
>
>
>> Here is what it would look like as bcanalyzer output:
>>
>> <MODULE_BLOCK>
>>   <VERSION op0=2>
>>   <COMDAT op0=0 ...> ; name = foo
>>   <FUNCTION op0=0 ...> ; name = foo
>>   <GLOBALVAR op0=4 ...> ; name = bar
>>   <ALIAS op0=8 ...> ; name = baz
>>  ; function bodies, etc.
>> </MODULE_BLOCK>
>> <STRTAB_BLOCK>
>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>> </STRTAB_BLOCK>
>>
>>
>> Why is the string table after the module instead of before?
>>
>
> For implementation simplicity. The idea is that the BitcodeWriter would
> have a member of type StringTableBuilder which would accumulate strings
> while writing the bitcode module(s) (and symtab in the future). At the end,
> the client would call something like BitcodeWriter::writeStrtab() which
> would write out the string table.
>
>
>>
>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means
>> that bitcode files can continue to be concatenated with "llvm-cat -b".
>> (Normally bitcode files would contain a single string table, which in
>> multi-module bitcode files would be shared between modules.)
>>
>> This *almost* allows us to remove the global (top-level) VST entirely, if
>> not for the function offset in the FNENTRY record. However, this offset is
>> not actually required because we can scan the module's FUNCTION_BLOCK_IDs
>> as we were doing before http://reviews.llvm.org/D12536 (this may have a
>> performance impact, so I'll measure it first).
>>
>> Assuming that performance looks good, does this seem reasonable to folks?
>>
>>
>>
>> I rather seek to have a symbol table that entirely replace the VST, kee.
>> If there is a perf impact with the FNENTRY offset, why can’t it be
>> replicated in the symbol table?
>>
>
> Sure, we could in principle store function offsets in the symbol table as
> well, if that helps with performance. But I want to measure the impact and
> find out whether that is actually the case first.
>
> Thanks,
> --
> --
> Peter
>
>
>


-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/1a566d64/attachment-0001.html>


More information about the llvm-dev mailing list