[cfe-dev] RFC: Up front type information generation in clang and llvm

Tue Mar 29 23:35:34 PDT 2016

How will this affect other languages that generate debug info - not that
you should care about those, I'm just curious - my Pascal compiler does not
generate clang-style AST, and does not use clang at all. I currently have
code that in uses DIBuilder directly...

--
Mats

On 30 March 2016 at 04:15, Eric Christopher via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

>
>
> On Tue, Mar 29, 2016 at 8:11 PM Peter Collingbourne <peter at pcc.me.uk>
> wrote:
>
>> On Tue, Mar 29, 2016 at 7:43 PM, Eric Christopher <echristo at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Tue, Mar 29, 2016 at 7:31 PM Peter Collingbourne <peter at pcc.me.uk>
>>> wrote:
>>>
>>>> Thanks for sharing this. Mostly seems like a reasonable plan to me. A
>>>> few comments below.
>>>>
>>>>
>>> Thanks Peter!
>>>
>>>
>>>> On Tue, Mar 29, 2016 at 6:00 PM, Eric Christopher via cfe-dev <
>>>> cfe-dev at lists.llvm.org> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> This is something that's been talked about for some time and it's
>>>>> probably time to propose it.
>>>>>
>>>>> The "We" in this document is everyone on the cc line plus me.
>>>>>
>>>>> Please go ahead and take a look.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -eric
>>>>>
>>>>>
>>>>> Objective (and TL;DR)
>>>>> =================
>>>>>
>>>>> Migrate debug type information generation from the backends to the
>>>>> front end.
>>>>>
>>>>> This will enable:
>>>>> 1. Separation of concerns and maintainability: LLVM shouldn’t have to
>>>>> know about C preprocessor macros, Obj-C properties, or extensive details
>>>>> about debug information binary formats.
>>>>> 2. Performance: Skipping a serialization should speed up normal
>>>>> compilations.
>>>>> 3. Memory usage: The DI metadata structures are smaller than they
>>>>> were, but are still fairly large and pointer heavy.
>>>>>
>>>>> Motivation
>>>>> ========
>>>>>
>>>>> Currently, types in LLVM debug info are described by the DIType class
>>>>> hierarchy. This hierarchy evolved organically from a more flexible
>>>>> sea-of-nodes representation into what it is today - a large, only somewhat
>>>>> format neutral representation of debug types. Making this more format
>>>>> neutral will only increase the memory use - and for no reason as type
>>>>> information is static (or nearly so). Debug formats already have a memory
>>>>> efficient serialization, their own binary format so we should support a
>>>>> front end emitting type information with sufficient representation to allow
>>>>> the backend to emit debug information based on the more normal IR features:
>>>>> functions, scopes, variables, etc.
>>>>>
>>>>> Scope/Impact
>>>>> ===========
>>>>>
>>>>> This is going to involve large scale changes across both LLVM and
>>>>> clang. This will also affect any out-of-tree front ends, however, we expect
>>>>> the impact to be on the order of a large API change rather than needing
>>>>> massive infrastructure changes.
>>>>>
>>>>> Related work
>>>>> ==========
>>>>>
>>>>> This is related to the efforts to support CodeView in LLVM and clang
>>>>> as well as efforts to reduce overall memory consumption when compiling with
>>>>> debug information enabled;  in particular efforts to prune LTO memory usage.
>>>>>
>>>>>
>>>>> Concerns
>>>>> ========
>>>>>
>>>>>
>>>>> We need a good story for transitioning all the debug info testcases in
>>>>> the backend without giving up coverage and/or readability. David believes
>>>>> he has a plan here.
>>>>>
>>>>> Proposal
>>>>> =======
>>>>>
>>>>> Short version
>>>>> -----------------
>>>>>
>>>>> 1. Split the DIBuilder API into Types (+Macros, Imports, …) and Line
>>>>> Table.
>>>>> 2. Split the clang CGDebugInfo API into Types and Line Table to match.
>>>>> 3. Add a LLVM DWARF emission library similar to the existing CodeView
>>>>> one.
>>>>> 4. Migrate the Types API into a clang internal API taking clang AST
>>>>> structures and use the LLVM binary emission libraries to produce type
>>>>> information.
>>>>> 5. Remove the old binary emission out of LLVM.
>>>>>
>>>>>
>>>>> Questions/Thoughts/Elaboration
>>>>> -------------------------------------------
>>>>>
>>>>> Splitting the DIBuilder API
>>>>> ~~~~~~~~~~~~~~~~~~~~
>>>>> Will DISubprogram be part of both?
>>>>>    * We should split it in two: Full declarations with type and a
>>>>> slimmed down version with an abstract origin.
>>>>>
>>>>> How will we reference types in the DWARF blob?
>>>>>    * ODR types can be referenced by name
>>>>>    * Non-odr types by full DWARF hash
>>>>>    * Each type can be a pair(tuple) of identifier (DITypeRef today)
>>>>> and blob.
>>>>>    * For < DWARF4 we can emit each type as a unit, but not a DWARF
>>>>> Type Unit and use references and module relocations for the offsets. (See
>>>>> below)
>>>>>
>>>>> How will we handle references in DWARF2 or global relocations for
>>>>> non-type template parameters?
>>>>>    * We can use a “relocation” metadata as part of the format.
>>>>>    * Representable as a tuple that has the DIType and the offset
>>>>> within the DIBlob as where to write the final relocation/offset for the
>>>>> reference at emission time.
>>>>>
>>>>> Why break up the types at all?
>>>>>    * To enable non-debug format aware linking and type uniquing for
>>>>> LTO that won’t be huge in size. We break up the types so we don’t need to
>>>>> parse debug information to link two modules together efficiently.
>>>>>
>>>>
>>>> How do you plan to handle abbreviations? You wouldn't necessarily be
>>>> able to embed them directly in the blob, as when doing LTO each compilation
>>>> unit would have its own set of abbreviations. I suppose you could do
>>>> something like treat them as a special sort of reference to an abbreviation
>>>> table entry, or maybe pre-allocate in the frontend (but would complicate
>>>> cross-frontend LTO) but curious what you have in mind.
>>>>
>>>
>>> Thanks for reminding me, I knew I was forgetting something I'd talked
>>> about when writing all of this down. :)
>>>
>>> Basically to handle abbreviations you can do them the similarly to types
>>> by creating a blob with an index/hash/etc and then reference that as part
>>> of the type tuple, e.g.:
>>>
>>> $1 = { DIAbbrev: 0x1234, DIBlob: <blah> }
>>> $2 = { DIType: <ID>, DIAbbrev: $1, DIBlob: <blah> }
>>>
>>> and keep them uniqued during emission and remember to merge these as
>>> well during module merge time.
>>>
>>
>> Makes sense, but wouldn't you need multiple abbreviations for each
>> DIType, in order to represent DITypes formed of multiple DIEs (e.g. enums,
>> records)?
>>
>> Maybe something like this would work:
>>
>> $1 = { DIAbbrev: 0x1234, DIBlob: DW_TAG_enumeration_type<blah> }
>> $2 = { DIAbbrev: 0x5678, DIBlob: DW_TAG_enumerator<blah> }
>> $3 = { DIType: <ID>, DIAbbrev: [(0, $1), (8, $2), (16, $2)], DIBlob: <8
>> bytes of DW_TAG_enumeration_type attrs><8 bytes of DW_TAG_enumerator
>> attrs><8 bytes of DW_TAG_enumerator attrs><0> }
>>
>> ?
>>
>
> *nod* That (or something similar) will work.
>
> -eric
>
>
>
>>
>>
>>>
>>>>
>>>> Any other concerns there?
>>>>>    * Debug information without type units might be slightly larger in
>>>>> this scheme due to parents being duplicated (declarations and abstract
>>>>> origin, not full parents). It may be possible to extend dsymutil/etc to
>>>>> merge all siblings into a common parent. Open question for better ways to
>>>>> solve this.
>>>>>
>>>>
>>>> When we were thinking about teaching the backend to produce blobs from
>>>> IR metadata we were thinking about cases where the debug info emitter would
>>>> discover special member functions during IR traversal. I guess since we're
>>>> moving all of that to the frontend we can just ask the frontend directly
>>>> which special members are needed on the class. That solves the problem for
>>>> a single translation unit. But what do you plan to do in the multiple
>>>> translation unit case where two TUs declare different special members on a
>>>> class? Would it be fine to just emit the two definitions and let the
>>>> debugger sort it out? I guess this is the type of thing that debuggers
>>>> normally deal with in the non-LTO case, so I suppose so?
>>>>
>>>
>>> Pretty much. This is one area where I have... disagreements with the
>>> DWARF committee and I don't think there's anything else we can do here. TBH
>>> right now I think we'd have issues with type units and special member
>>> functions since we're using ODR-ness to unique.
>>>
>>> -eric
>>>
>>>
>>>>
>>>>
>>>>> How should we handle DWARF5/Apple Accelerator Tables?
>>>>>    * Thoughts:
>>>>>    * We can parse the dwarf in the back end and generate them.
>>>>>    * We can emit in the front end for the base case of non-LTO (with
>>>>> help from the backend for relocation aspects).
>>>>>    * We can use dsymutil on LTO debug information to generate them.
>>>>>
>>>>> Why isn’t this a more detailed spec?
>>>>>    * Mostly because we’ve thought about the issues, but we can’t plan
>>>>> for everything during implementation.
>>>>>
>>>>>
>>>>> Future work
>>>>> ----------------
>>>>>
>>>>> Not contained as part of this, but an obvious future direction is that
>>>>> the Module linker could grow support for debug aware linking. Then we can
>>>>> have all of the type information for a single translation unit in a single
>>>>> blob and use the debug aware linking to handle merging types.
>>>>>
>>>>> _______________________________________________
>>>>> cfe-dev mailing list
>>>>> cfe-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> Peter
>>>>
>>>
>>
>>
>> --
>> --
>> Peter
>>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160330/8ebb39a3/attachment.html>