[llvm-dev] RFC: Adding a string table to the bitcode format

Tue Apr 4 14:10:18 PDT 2017

On Tue, Apr 4, 2017 at 2:04 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Apr 4, 2017, at 1:40 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>
>
>
> On Tue, Apr 4, 2017 at 1:25 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
>>
>> On Apr 4, 2017, at 12:12 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>>
>> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>>>
>>> Hi,
>>>
>>> As part of PR27551 I want to add a string table to the bitcode format to
>>> allow global value and comdat names to be shared with the proposed symbol
>>> table (and, as side effects, allow comdat names to be shared with value
>>> names, make bitcode files more compressible and make bitcode easier to
>>> parse). The format of the string table would be a top-level block
>>> containing a blob containing null-terminated strings [0] similar to the
>>> string table format used in most object files.
>>>
>>>
>>>
>>> I’m in favor of this, but note that currently string can be encoded with
>>> less than 8 bits / char in some cases (there might some size increase
>>> because of this).
>>>
>>
>> Sure, but I think we need to make the right tradeoff between making data
>> more efficient to read and using fewer bits. In this case I think the right
>> tradeoff is clearly in favour of being efficient to read, because accessing
>> it is in the critical path of a consumer (i.e. a linker), and the part that
>> needs to be efficient to read is a relatively small part of the data in the
>> bitcode file. The same logic applies to the symbol table (note that we use
>> support::ulittle32_t instead of a bit encoding).
>>
>> That said we already paid this with the metadata table in the recent past
>>> for example.
>>>
>>
>>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
>>> records would change so that their first operand would specify their names
>>> with a byte offset into the string table. (To allow for backwards
>>> compatibility, I would increment the bitcode version.)
>>>
>>>
>>> I assume you mean the EPOCH?
>>>
>>
>> No, the MODULE_CODE_VERSION.
>> http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
>> It isn't clear to me why we have both.
>>
>>
>>> Here is what it would look like as bcanalyzer output:
>>>
>>> <MODULE_BLOCK>
>>>   <VERSION op0=2>
>>>   <COMDAT op0=0 ...> ; name = foo
>>>   <FUNCTION op0=0 ...> ; name = foo
>>>   <GLOBALVAR op0=4 ...> ; name = bar
>>>   <ALIAS op0=8 ...> ; name = baz
>>>  ; function bodies, etc.
>>> </MODULE_BLOCK>
>>> <STRTAB_BLOCK>
>>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>>> </STRTAB_BLOCK>
>>>
>>>
>>> Why is the string table after the module instead of before?
>>>
>>
>> For implementation simplicity. The idea is that the BitcodeWriter would
>> have a member of type StringTableBuilder which would accumulate strings
>> while writing the bitcode module(s) (and symtab in the future). At the end,
>> the client would call something like BitcodeWriter::writeStrtab() which
>> would write out the string table.
>>
>>
>> There is already a traversal of the module for value numbering, building
>> the StringTable at the same time seems quite natural to me.
>>
>
> Other modules in the same bitcode file may need to add names to the string
> table, and the symbol table builder may also need to add mangled names.
> Trying to impose an ordering on all of those components doesn't seem worth
> it in my opinion.
>
>
> I’d stick with a single table per module, to be able to preserve the
> ability to perform binary split of modules.
>

We can still extract individual modules by concatenating the module and the
string table.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/f31c1a9d/attachment.html>