[llvm-dev] DWARF: Should type units be referenced by signature or declaration?

Tue Feb 14 19:31:56 PST 2017

Hi David, this is pretty dense, I've asked for some clarifications
as well as making a few other comments.

Broadly speaking it seems like there are a couple of problems:
- picking what should be put into a type unit relies on poor heuristics;
- can have an excessive volume of stuff dragged along with the type that 
  you actually *want* to put into the type unit.
If there was another one in addition, please reiterate it.

Relying on "has a mangled name" as a proxy for "can go into a type unit"
(which is what it sounded like, not sure I understood that part correctly)
seems like not a great fit.  In the original type-unit discussions, I
remember Cary talking about using the signature as the group key; is that
infeasible for some reason?  It would keep unnamed enums from being 
excluded from type units, as well as being considerably shorter than most 
mangled names (saving even more space!).

(Nothing from inside an anonymous namespace should go into a type unit.
Type units are for enabling de-duplication, and anonymous namespaces by
definition do not contain anything sharable across CUs.  It wasn't clear
whether you thought the current state of things there was good or bad.)

And the part about throwing away work late, after discovering something 
relies on an address...  That one would depend on how often you had to 
throw away work in practice.  I could imagine doing some kind of 
isOKForTypeUnit() predicate, traversing the type tree before we actually
emit anything, but whether that's profitable is a performance question
that requires experimental evidence.

As for reducing the volume of stuff in the type unit, I think I need
the clarifications I've asked for below in order to get a handle on
what you are thinking there.

I am interested in having type units be useful and effective, so this
is something I am happy to devote some time to.  See inline comments
and looking forward to figuring all this stuff out.
--paulr

> From: David Blaikie [mailto:dblaikie at gmail.com] 
> Sent: Friday, February 03, 2017 7:16 PM
> To: llvm-dev; Robinson, Paul; Eric Christopher
> Cc: Adrian Prantl
> Subject: DWARF: Should type units be referenced by signature or declaration?
>
> Bunch of initially unrelated context:
>
> * type units can be referenced in a variety of ways:
>  * DW_FORM_ref_sig8 on any attribute needing to reference the type
>  * DW_AT_signature on a declaration of the type
>  * extra wrinkle: the declaration can be nested into the appropriate namespace and given a name, or not

Sorry, didn't follow the wrinkle.

>  * LLVM always does the "most expressive"/expensive thing: a full declaration (though without a name, but with the DW_AT_signature) in the correct namespace.

If you could unpack this a little more, that would help.

>  * GCC is more selective/nuanced in its choice fo representation, depending on context.
> * Types may be emitted unreferenced (LLVM's retained types list, which will be more strongly leveraged for C++ modules + debug info in the near future) into type units, or directly into the CU
> * Types that reference addresses (pointer non-type template parameters, for example) may not be in type units when using Fission (they have no way to reference the address pool)
>  * The LLVM implementation of this isn't terribly efficient - a flag is lowered on the address pool, if at any point an address is required the flag is raised and all subsequent type creation is skipped, once control returns to the code responsible for creating the type unit, the flag is examined and if it is up - all the work is thrown out, and the type is then created in the CU.
> * Type units have some overhead (2x on GCC, 1.5x on Clang (as measured by the difference between the reduction in debug_info size compared to the increase in debug_type size) when I measured a while ago)
> * LLVM uses the mangled name of the type as the deduplication key for type units
>  * because of this, LLVM doesn't produce type units for non-public types (eg: classes in anonymous namespaces - or unnamed enums... (this latter one produces some wrinkles))
>
> Motivation:
> * Types that are only emitted once across the program (eg: attached to a template explicit instantiation definition or emitted due to a strong vtable) shouldn't be put into type units so they don't pay the overhead.
>
> Issues:
> * This leads to type unit types referencing non-type unit types - what DWARF should be used for that? a type declaration in the type unit? I think: yes

A type unit has "a single complete type definition."  If the type's definition can be considered complete while making use of other types that are merely declarations, seems fine to me.

> * This issue sort of already comes up & is punted if the ODR is violated. If an external type references an internal type, the internal type is emitted into the type unit (& into any other TU/CU that uses it - much duplication)

Yep

> * If type units may reference other types by declaration (already true - a type may only be available as a declaration) - why not referencing all types by declaration?

You can probably figure out how to replace some referenced definitions by declarations, when constructing the type unit. Can't be done for everything (there's no way to turn a base_type into a declaration) and for cases where you can substitute, the question is whether the consumer can do anything useful with the breadcrumbs you have left behind.

>  * Is there substantial benefit to the debugger to not have to do name resolution, but rather to match types by signature directly?

Once I understand the question, I can ask our debugger people. :-)

> * Since type units can be emitted without an reference to them from the CU, a consumer can't rely on reachability of the type unit reference graph so this should be only a performance concern, not a correctness one.

Rely on reachability of the type unit reference graph?  Feels like there's a use-case you have in mind that I am not reconstructing.

> * If declarations are used selectively or pervasively, this would help address pool issue too: even if a type uses an address, it would go in the CU but types referencing that type could still remain in a TU.
>
> So, barring anything else, I'm sort of inclined to just make all references to types in type units plain declarations (oh, also, DW_AT_declaration + DW_AT_name is smaller than DW_AT_declaration + DW_AT_signature (4 bytes instead of 8)). Simpler implementation, possible performance loss for the debugger (lacking the shortcut to find a type by signature instead of name lookup) and should tidy up a bunch of oddities as well as paving the way for improvements around types that don't need type units.
>
> Any thoughts/suggestions/(dis)recommendations?
>
> - Dave
>
> Bonus question: it's possible that the type-with-addresses issue could be checked up front (the DICompositeType could be examined for all its template parameters to see if any involve addresses of globals) but that seems a little brittle (other uses of addresses could crop up - like some IR producer could create a member function declaration in the member list for a member function template instantiation (Clang doesn't do this - member function template instantiations refer to the class as their scope, but do not appear in the member list - this keeps types uniform across translation units), for example) but could simplify the implementation in terms of not needing to do a bunch of (now much less if all the intermediate types don't need to be thrown out too) work that may be thrown out. Worth it? Other ideas?
> (also: GCC doesn't implement this rule, so its Fission+type units should have trouble resolving addresses & may end up referring to the wrong address pool, etc)