[llvm-dev] RFC: Adding a string table to the bitcode format

Teresa Johnson via llvm-dev llvm-dev at lists.llvm.org
Tue Apr 4 13:23:37 PDT 2017


On Tue, Apr 4, 2017 at 12:21 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:

>
>
> On Tue, Apr 4, 2017 at 7:37 AM, Teresa Johnson <tejohnson at google.com>
> wrote:
>
>>
>>
>> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>>>
>>> Hi,
>>>
>>> As part of PR27551 I want to add a string table to the bitcode format to
>>> allow global value and comdat names to be shared with the proposed symbol
>>> table (and, as side effects, allow comdat names to be shared with value
>>> names, make bitcode files more compressible and make bitcode easier to
>>> parse). The format of the string table would be a top-level block
>>> containing a blob containing null-terminated strings [0] similar to the
>>> string table format used in most object files.
>>>
>>>
>>>
>>> I’m in favor of this, but note that currently string can be encoded with
>>> less than 8 bits / char in some cases (there might some size increase
>>> because of this).
>>> That said we already paid this with the metadata table in the recent
>>> past for example.
>>>
>>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
>>> records would change so that their first operand would specify their names
>>> with a byte offset into the string table. (To allow for backwards
>>> compatibility, I would increment the bitcode version.)
>>>
>>>
>>> I assume you mean the EPOCH?
>>>
>>> Here is what it would look like as bcanalyzer output:
>>>
>>> <MODULE_BLOCK>
>>>   <VERSION op0=2>
>>>   <COMDAT op0=0 ...> ; name = foo
>>>   <FUNCTION op0=0 ...> ; name = foo
>>>   <GLOBALVAR op0=4 ...> ; name = bar
>>>   <ALIAS op0=8 ...> ; name = baz
>>>  ; function bodies, etc.
>>> </MODULE_BLOCK>
>>> <STRTAB_BLOCK>
>>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>>> </STRTAB_BLOCK>
>>>
>>>
>>> Why is the string table after the module instead of before?
>>>
>>>
>>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means
>>> that bitcode files can continue to be concatenated with "llvm-cat -b".
>>>
>>> Do you mean "apply to all preceding MODULE_BLOCKs that aren't followed
>> by an intervening STRTAB_BLOCK"? I.e. when bitcode files are concatenated
>> you presumably don't want to apply a STRTAB_BLOCK to a MODULE_BLOCK from a
>> different input bitcode file that has its own STRTAB_BLOCK.
>>
>
> Yes, sorry, that is exactly what I meant.
>
>> (Normally bitcode files would contain a single string table, which in
>>> multi-module bitcode files would be shared between modules.)
>>>
>>> This *almost* allows us to remove the global (top-level) VST entirely,
>>> if not for the function offset in the FNENTRY record. However, this offset
>>> is not actually required because we can scan the module's
>>> FUNCTION_BLOCK_IDs as we were doing before http://reviews.llvm.org
>>> /D12536 (this may have a performance impact, so I'll measure it first).
>>>
>>> Assuming that performance looks good, does this seem reasonable to folks?
>>>
>>>
>>>
>>> I rather seek to have a symbol table that entirely replace the VST, kee.
>>> If there is a perf impact with the FNENTRY offset, why can’t it be
>>> replicated in the symbol table?
>>>
>>
>> Won't the new symbol table be added before the top-level VST can be
>> removed, i.e. you need the linkage types etc right? In that case, can the
>> offset just be added to the new symbol table? That would be more analogous
>> to object file symbol tables which also have an offset anyway.
>>
>
> The VST only stores names (and function offsets). The other attributes are
> stored on the MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC} records. So
> once we move the names elsewhere, the VST isn't really storing much data at
> all.
>

Ok right, that's true... We could probably benchmark the removal of the
offsets on a clang ThinLTO bootstrap. As mentioned off-list to pcc, the
theoretical benefit when I added those offsets was largely because we were
planning to do iterative importing in the ThinLTO backends, which of course
we don't do anymore.

Teresa


> As I mentioned to Mehdi, we could indeed store the function offset in the
> symbol table. That would be done in a separate step to this change, which
> is just about string tables.
>
> Thanks,
> --
> --
> Peter
>



-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |  408-460-2413
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/0e46afd3/attachment.html>


More information about the llvm-dev mailing list