[cfe-dev] RFC: Up front type information generation in clang and llvm

Tue Mar 29 20:15:41 PDT 2016

On Tue, Mar 29, 2016 at 8:11 PM Peter Collingbourne <peter at pcc.me.uk> wrote:

> On Tue, Mar 29, 2016 at 7:43 PM, Eric Christopher <echristo at gmail.com>
> wrote:
>
>>
>>
>> On Tue, Mar 29, 2016 at 7:31 PM Peter Collingbourne <peter at pcc.me.uk>
>> wrote:
>>
>>> Thanks for sharing this. Mostly seems like a reasonable plan to me. A
>>> few comments below.
>>>
>>>
>> Thanks Peter!
>>
>>
>>> On Tue, Mar 29, 2016 at 6:00 PM, Eric Christopher via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>> Hi All,
>>>>
>>>> This is something that's been talked about for some time and it's
>>>> probably time to propose it.
>>>>
>>>> The "We" in this document is everyone on the cc line plus me.
>>>>
>>>> Please go ahead and take a look.
>>>>
>>>> Thanks!
>>>>
>>>> -eric
>>>>
>>>>
>>>> Objective (and TL;DR)
>>>> =================
>>>>
>>>> Migrate debug type information generation from the backends to the
>>>> front end.
>>>>
>>>> This will enable:
>>>> 1. Separation of concerns and maintainability: LLVM shouldn’t have to
>>>> know about C preprocessor macros, Obj-C properties, or extensive details
>>>> about debug information binary formats.
>>>> 2. Performance: Skipping a serialization should speed up normal
>>>> compilations.
>>>> 3. Memory usage: The DI metadata structures are smaller than they were,
>>>> but are still fairly large and pointer heavy.
>>>>
>>>> Motivation
>>>> ========
>>>>
>>>> Currently, types in LLVM debug info are described by the DIType class
>>>> hierarchy. This hierarchy evolved organically from a more flexible
>>>> sea-of-nodes representation into what it is today - a large, only somewhat
>>>> format neutral representation of debug types. Making this more format
>>>> neutral will only increase the memory use - and for no reason as type
>>>> information is static (or nearly so). Debug formats already have a memory
>>>> efficient serialization, their own binary format so we should support a
>>>> front end emitting type information with sufficient representation to allow
>>>> the backend to emit debug information based on the more normal IR features:
>>>> functions, scopes, variables, etc.
>>>>
>>>> Scope/Impact
>>>> ===========
>>>>
>>>> This is going to involve large scale changes across both LLVM and
>>>> clang. This will also affect any out-of-tree front ends, however, we expect
>>>> the impact to be on the order of a large API change rather than needing
>>>> massive infrastructure changes.
>>>>
>>>> Related work
>>>> ==========
>>>>
>>>> This is related to the efforts to support CodeView in LLVM and clang as
>>>> well as efforts to reduce overall memory consumption when compiling with
>>>> debug information enabled;  in particular efforts to prune LTO memory usage.
>>>>
>>>>
>>>> Concerns
>>>> ========
>>>>
>>>>
>>>> We need a good story for transitioning all the debug info testcases in
>>>> the backend without giving up coverage and/or readability. David believes
>>>> he has a plan here.
>>>>
>>>> Proposal
>>>> =======
>>>>
>>>> Short version
>>>> -----------------
>>>>
>>>> 1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line
>>>> Table.
>>>> 2. Split the clang CGDebugInfo API into Types and Line Table to match.
>>>> 3. Add a LLVM DWARF emission library similar to the existing CodeView
>>>> one.
>>>> 4. Migrate the Types API into a clang internal API taking clang AST
>>>> structures and use the LLVM binary emission libraries to produce type
>>>> information.
>>>> 5. Remove the old binary emission out of LLVM.
>>>>
>>>>
>>>> Questions/Thoughts/Elaboration
>>>> -------------------------------------------
>>>>
>>>> Splitting the DIBuilder API
>>>> ~~~~~~~~~~~~~~~~~~~~
>>>> Will DISubprogram be part of both?
>>>>    * We should split it in two: Full declarations with type and a
>>>> slimmed down version with an abstract origin.
>>>>
>>>> How will we reference types in the DWARF blob?
>>>>    * ODR types can be referenced by name
>>>>    * Non-odr types by full DWARF hash
>>>>    * Each type can be a pair(tuple) of identifier (DITypeRef today) and
>>>> blob.
>>>>    * For < DWARF4 we can emit each type as a unit, but not a DWARF Type
>>>> Unit and use references and module relocations for the offsets. (See below)
>>>>
>>>> How will we handle references in DWARF2 or global relocations for
>>>> non-type template parameters?
>>>>    * We can use a “relocation” metadata as part of the format.
>>>>    * Representable as a tuple that has the DIType and the offset within
>>>> the DIBlob as where to write the final relocation/offset for the reference
>>>> at emission time.
>>>>
>>>> Why break up the types at all?
>>>>    * To enable non-debug format aware linking and type uniquing for LTO
>>>> that won’t be huge in size. We break up the types so we don’t need to parse
>>>> debug information to link two modules together efficiently.
>>>>
>>>
>>> How do you plan to handle abbreviations? You wouldn't necessarily be
>>> able to embed them directly in the blob, as when doing LTO each compilation
>>> unit would have its own set of abbreviations. I suppose you could do
>>> something like treat them as a special sort of reference to an abbreviation
>>> table entry, or maybe pre-allocate in the frontend (but would complicate
>>> cross-frontend LTO) but curious what you have in mind.
>>>
>>
>> Thanks for reminding me, I knew I was forgetting something I'd talked
>> about when writing all of this down. :)
>>
>> Basically to handle abbreviations you can do them the similarly to types
>> by creating a blob with an index/hash/etc and then reference that as part
>> of the type tuple, e.g.:
>>
>> $1 = { DIAbbrev: 0x1234, DIBlob: <blah> }
>> $2 = { DIType: <ID>, DIAbbrev: $1, DIBlob: <blah> }
>>
>> and keep them uniqued during emission and remember to merge these as well
>> during module merge time.
>>
>
> Makes sense, but wouldn't you need multiple abbreviations for each DIType,
> in order to represent DITypes formed of multiple DIEs (e.g. enums, records)?
>
> Maybe something like this would work:
>
> $1 = { DIAbbrev: 0x1234, DIBlob: DW_TAG_enumeration_type<blah> }
> $2 = { DIAbbrev: 0x5678, DIBlob: DW_TAG_enumerator<blah> }
> $3 = { DIType: <ID>, DIAbbrev: [(0, $1), (8, $2), (16, $2)], DIBlob: <8
> bytes of DW_TAG_enumeration_type attrs><8 bytes of DW_TAG_enumerator
> attrs><8 bytes of DW_TAG_enumerator attrs><0> }
>
> ?
>

*nod* That (or something similar) will work.

-eric

>
>
>>
>>>
>>> Any other concerns there?
>>>>    * Debug information without type units might be slightly larger in
>>>> this scheme due to parents being duplicated (declarations and abstract
>>>> origin, not full parents). It may be possible to extend dsymutil/etc to
>>>> merge all siblings into a common parent. Open question for better ways to
>>>> solve this.
>>>>
>>>
>>> When we were thinking about teaching the backend to produce blobs from
>>> IR metadata we were thinking about cases where the debug info emitter would
>>> discover special member functions during IR traversal. I guess since we're
>>> moving all of that to the frontend we can just ask the frontend directly
>>> which special members are needed on the class. That solves the problem for
>>> a single translation unit. But what do you plan to do in the multiple
>>> translation unit case where two TUs declare different special members on a
>>> class? Would it be fine to just emit the two definitions and let the
>>> debugger sort it out? I guess this is the type of thing that debuggers
>>> normally deal with in the non-LTO case, so I suppose so?
>>>
>>
>> Pretty much. This is one area where I have... disagreements with the
>> DWARF committee and I don't think there's anything else we can do here. TBH
>> right now I think we'd have issues with type units and special member
>> functions since we're using ODR-ness to unique.
>>
>> -eric
>>
>>
>>>
>>>
>>>> How should we handle DWARF5/Apple Accelerator Tables?
>>>>    * Thoughts:
>>>>    * We can parse the dwarf in the back end and generate them.
>>>>    * We can emit in the front end for the base case of non-LTO (with
>>>> help from the backend for relocation aspects).
>>>>    * We can use dsymutil on LTO debug information to generate them.
>>>>
>>>> Why isn’t this a more detailed spec?
>>>>    * Mostly because we’ve thought about the issues, but we can’t plan
>>>> for everything during implementation.
>>>>
>>>>
>>>> Future work
>>>> ----------------
>>>>
>>>> Not contained as part of this, but an obvious future direction is that
>>>> the Module linker could grow support for debug aware linking. Then we can
>>>> have all of the type information for a single translation unit in a single
>>>> blob and use the debug aware linking to handle merging types.
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> Peter
>>>
>>
>
>
> --
> --
> Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160330/635cb17b/attachment.html>