[llvm-dev] DWARF: Reconstituting mangled names (& skipping DW_AT_linkage_name)

Tue Aug 24 16:39:29 PDT 2021

On Mon, Aug 23, 2021 at 3:15 PM Greg Clayton <clayborg at gmail.com> wrote:

> The idea of encoding names more efficiently is a great idea. I would have
> no concerns if the following were true:
> - we could 100% always reconstruct linkages names if we need to
>

Yep, that'd certainly be the plan. (well, that any place where we omit a
linkage name in the DWARF could be reconstructed - we can always keep
linkage names in places where the DWARF isn't expressive enough to produce
all teh info required for the linkage name).

> - accelerator tables that are trusted by debuggers (.debug_names, or
> .apple_XXX) that used to contain linkage names still do after this change
>

Sure - gets a bit trickier in the LLVM IR but do-able. (some way to specify
that the pretty name (for my other proposal about simplified template
names) and/or linkage name (for this proposal) are only present, or only
qualified with template parameters, for accelerated access and not for the
DIE attributes)

> The main reason for this is for the LLDB expression parser. When the
> expression parser needs to call a function, the interface we have with the
> JIT code in LLVM means we always lookup functions by linkage (mangled)
> name. So if the accelerator tables don't have the mangled names inside of
> them, we will need to know how/when we would need to ignore the accelerator
> tables and manually index the DWARF each time you debug. Right now LLDB and
> GDB don't trust .debug_pubnames or .debug_pubtypes because they don't index
> everything. .debug_names has more struct rules on what needs to be
> included, so any solution should make sure we don't change the contents of
> this section for a binary compiled with and without this new feature.
>
> I like the idea of being able to refer to a string from the main string
> table of the object file (.strtab for ELF, or LC_SYMTAB in macho) if they
> already exist there, it would be interesting to compare the symbols that
> are in both the .debug_str and .symtab from one of these large C++ binaries
> just to see how much space we could save if we had a new for
> DW_FORM_symtab_str that could refer to this section.
>

Yeah, that should be pretty close to the numbers I've seen - I mean, not
every linkage name is in the symtab - because we have linkage names for
fully inlined functions, which wouldn't be in the symtab.

But I also have ideas of removing the linkage names from the symtab too -
well, depending on how you think about it, maybe changing the mangling from
itanium to a hashed name. Then there's an interesting question of what a
given consumer wants when they talk about the linkage name - if they want
the name of the ELF symbol, that'll be correct, but if they want something
that can be demangled, they would need a different name.

Another idea would be to have a new attribute that relies on the parent DIE
> chain where each child would encode it's partial mangled named. Something
> like DW_AT_linkage_prefix and/or DW_AT_linkage_suffix. Then you could
> traverse the parent DIEs to reconstruct the full linkage name.
>
> So if we have
>
> namepace foo {
>   class bar {
>     void print(const char *) const;
>   }
> }
>
> The DWARF could be something like:
>
> DW_TAG_namespace
> DW_AT_name("foo")
> DW_AT_linkage_prefix("_Z3foo")
>
>   DW_TAG_class_type
>   DW_AT_name("bar")
>   DW_AT_linkage_prefix("3bar")
>
>     DW_TAG_subprogram
>     DW_AT_name("print")
>     DW_AT_linkage_prefix("5print")
>     DW_AT_linkage_suffix(" const")
>
>       DW_TAG_parameter
>       DW_AT_name("format")
>       DW_AT_linkage_prefix("int")
>
> This might allow a lot more name sharing between templated functions since
> their function base names like "erase", "begin", "end" and many more could
> be shared in the string tables.
>

Yeah, that doesn't capture the majority of the cost I'm dealing with -
where there's lots of complexity due to various very complicated template
parameters.

- Dave

>
>
>
> On Jul 2, 2021, at 1:59 PM, David Blaikie via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Thu, Jul 1, 2021 at 8:22 PM Reid Kleckner <rnk at google.com> wrote:
>
>> It could work, but the long linkage names will still be present in
>> .strtab, so I wonder if it would make more sense to pursue a solution that
>> addresses both issues. I happen to know you were considering a separate
>> proposal for that, and I wonder if it could be used to solve this problem
>> as well. Either way, the debug info consumer must be taught to look up or
>> reconstitute the long mangled name.
>>
>
> True.
>
> (for everyone else's context: I've been tossing around the idea for a
> while to have an option to use hashed names instead of mangled names for
> object symbols (actually I're starting to consider maybe generalizing this
> to an entire floating ABI - if you can guarantee all the C++ is being
> compiled with the same clang version - it can arbitrarily pick ABI, symbol
> names, etc, that only have to agree with itself - not with some other
> version used to compile some precompiled library, etc) - though we'd still
> want to preserve the mangled names maybe heaped together in a compressed
> section, so that the linker could provide human-actionable diagnostics to
> the user in the event of linker errors)
>
> Though I worry that even some way to reference strings in that compressed
> blob would take up space we could be saving & the time/space tradeoff might
> not be worthwhile. Referencing (rather than reconstituting) would have the
> advantage that there would be no risk of incorrect reconstitution, which
> would be nice - but could be limiting. (for instance - we might at some
> point want to support links with the symbol names omitted in some modes
> where linker errors are especially unlikely (continuous integration, etc) -
> then repeat the link with the symbol names added to get good diagnostics -
> though I suppose in many cases like that we wouldn't want debug info
> either... but maybe sometimes, etc)
>
>
>> I was thinking something like, "if symbol name is longer than X
>> threshold, replace it with _H${contenthash}, place the long name in a side
>> table section". Tools that are aware of the new convention can do the
>> lookup in the side table. Tools that are unaware will just produce funny
>> names. The DWARF linkage name would use the _H symbol, and consumers that
>> care beyond just having a unique linkage identifier can do the lookup.
>>
>
> Yeah, with DWARF we'd probably make something a bit more explicit - a new
> DW_FORM, or new attribute name - though guess there's some benefit to
> producing the unique name that everyone can use even if it's not very
> legible.
>
> Yeah, if I reframe this in my head: What if we fixed the ELF symbol name
> length problems (by using such a hash scheme) - would the remaining DWARF
> size cost be worth the complexity of reconstitution & risk of incorrect
> reconstitution? Maybe not.
>
> Though perhaps there's folks who might be interested in the reconstitution
> savings when they can't change their ABI? In that case it'd be pretty
> misleading to include an incorrect value for the mangled name in the
> DW_TAG_linkage_name field. We could introduce a different attribute for it
> in that case.
>
> (I guess if we used references to this shared "real linkage name section"
> - there wouldn't be an issue with stripped binaries: If you stripped out
> the linkage name section you probably stripped out the debug info sections
> too so there wouldn't be anything left to debug/reference the stripped
> linkage names)
>
> Alternatively: If we did this reconstituted linkage name thing, the hashed
> symbols ELF feature could potentially skip the linkage names when there's
> debug info present and rely on reconstituting the names...
>
> In summary: I've mixed thoughts on this.
>
> - Dave
>
>
>>
>> There is prior art for this. MSVC caps linkage names at 4096, I believe,
>> and hashes the name down with MD5:
>>
>> https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/MicrosoftMangle.cpp#L53
>>
>> On Thu, Jun 24, 2021 at 5:32 PM David Blaikie via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> In addition to simplifying template names (
>>> https://groups.google.com/g/llvm-dev/c/ekLMllbLIZg ) another case I've
>>> found in my use case is a lot of mangled names (in part because we build
>>> with -fdebug-info-for-profiling which turns on function linkage names even
>>> at -g1/-gmlt).
>>>
>>> So I was wondering if we could recreate linkage names from DWARF, rather
>>> than encoding them directly - and I have a prototype that seems to show
>>> this is possible (at least some simple cases - including some template
>>> cases).
>>>
>>> In the pathological case I'm looking at (lots of expression templates in
>>> TensorFlow) skipping linkage names in the cases I think we can reconstitute
>>> (but I haven't implemented the full logic and verified everything can be
>>> reconstituted) reduced .debug_str.dwo by 52% (and that composes/stacks with
>>> the 43% reduction from the simplified template names - for a 95% reduction
>>> in total) and in a large but less pathological binary it was 56% (in
>>> addition to 25% from the template names, still 80% reduction overall).
>>>
>>> Wondering if anyone's interested in this? Has
>>> thoughts/feelings/concerns/etc?
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210824/56594da2/attachment.html>