[LLVMdev] RFC: Metadata attachments to function definitions
Duncan P. N. Exon Smith
dexonsmith at apple.com
Mon Apr 20 10:44:57 PDT 2015
> On 2015-Apr-17, at 19:59, David Blaikie <dblaikie at gmail.com> wrote:
>
> On Fri, Apr 17, 2015 at 7:15 PM, Duncan P. N. Exon Smith
> <dexonsmith at apple.com> wrote:
>>
>>> On 2015 Apr 15, at 10:06, David Blaikie <dblaikie at gmail.com> wrote:
>>>
>>>
>>>
>>> On Tue, Apr 14, 2015 at 10:59 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>>>
>>>> On 2015 Apr 14, at 21:46, David Blaikie <dblaikie at gmail.com> wrote:
>>>>
>>>> On Tue, Apr 14, 2015 at 9:33 PM, Duncan P. N. Exon Smith
>>>> <dexonsmith at apple.com> wrote:
>>>>>
>>>>> `Function` definitions should support `MDNode` attachments, with a
>>>>> similar syntax to instructions:
>>>>>
>>>>> define void @foo() nounwind !attach !0 {
>>>>> unreachable
>>>>> }
>>>>> !0 = !{}
>>>>>
>>>>> Attachments wouldn't be allowed on declarations, just definitions.
>>>>>
>>>>> There are two open problems this can help with:
>>>>>
>>>>> 1. For PGO, we need somewhere to attach the function entry count.
>>>>> Attaching to the function definition is a simple solution.
>>>>
>>>> No comment - can't say I know anything about that.
>>>>
>>>>>
>>>>> define void @foo() !prof !0 {
>>>>> unreachable
>>>>> }
>>>>> !0 = !{i32 987}
>>>>>
>>>>> 2. In debug info, we repeatedly build up a map from `Function` to the
>>>>> canonical `MDSubrogram` for it.
>>>>
>>>> Sounds great - I'd imagine this working somewhat like the way I've
>>>> made implicit special members & other non-standard members of class
>>>> types work in the debug info metadata, which is to say that the
>>>> children reference the parent, but the parent doesn't reference the
>>>> children (in this case, that would mean things like comdat folding in
>>>> LTO would 'just work' - whichever function we picked would keep its
>>>> debug info and attachment to its CU - the other CU would just appear
>>>> not to have that function anymore - we might need a special case for
>>>> comdat folding where one copy has debug info and another copy doesn't,
>>>> make sure we move the metadata over if we're picking one without debug
>>>> info).
>>>>
>>>> Though that will mean that whenever we want to walk all the
>>>> subprograms of a CU, we'd have to build it by walking all the
>>>> Functions in a module - it might be worth checking if/when/where we
>>>> "need" to do that - I suspect it's rare and maybe can be made entirely
>>>> unnecessary.
>>>
>>> I *think* we only do this once, and it's (ironically) to create the
>>> "Subprogram -> CompileUnit" map.
>>>
>>> Right, I thought that might be the case.
>>>
>>> That highlights a problem that my proposal doesn't solve on its own.
>>> While this proposal creates a "Function -> Subprogram" map, there still
>>> isn't a "Subprogram -> CompileUnit" map -- that would still (for now)
>>> be stored implicitly via `MDCompileUnit::getSubprograms()`.
>>>
>>> I guess this is (also?) what I was thinking about.
>>>
>>> Why isn't this map encoded in the `scope:` chain? Because many of the
>>> `scope:` chains currently terminate at `MDFile`s or `null` instead of
>>> `MDCompileUnit`s. Why? Because LTO type uniquing needs scope chains
>>> to be identical.
>>>
>>> Ah, right.
>>>
>>> (side note: sometimes need to end in MDFile or we might need an equivalent of DILexicalBlockFile for the CU - to switch files within the same CU (things defined in headers, etc))
>>
>> Ah, okay. I thought we could just replace them with pointers to the
>> compile unit. Something like `DIFileScope` with `scope:` and
>> `file:` fields would probably work? (Which means I shouldn't have
>> merged the two types of "file" nodes when I introduced the new
>> hierarchy. Boo.)
>
> I'd have to double-check - maybe this is a non-issue & I've
> misremembered/misunderstood things here (DILexicalBlockFile is needed
> because we don't put file info on the DebugLocs to keep them small -
> maybe the MDSubprograms (& MDNamespace, etc) should/could/(already
> does?) just have a file attribute directly?)
Yes, they do already have a file attribute. Is that sufficient?
>
>>
>>>
>>> (I have a vague plan for fixing this, too: (1) move
>>> ownership of Metadata to the Module (so metadata isn't leaked by
>>> `lto_module_dispose()`), (2) add a "StringRef -> MDType" map to the
>>> Module (only for types with an ODR-style UUID), (3) delete the concept
>>> of `MDString`-based type refs and update lib/Linker to rely on the
>>> "StringRef -> MDType" map in the destination Module, (4) make all
>>> `scope:` chains terminate at an `MDCompileUnit` and drop "scope"-ness
>>> of `MDFile`, and (5) finally drop the `subprograms:` field from
>>> `MDCompileUnit`. But I'm not confident about step 4 yet.)
>>>
>>> Sounds plausible.
>>>
>>> (side note: any plans to do the scope-based textual IR that was thrown around during the prototype stages? It'd be another readability (& writability) improvement to not have to manually walk all the scope chains.
>>
>> Vague plans, but yes. Doing it the way I want is blocked on
>> removing type refs (maybe among other things?).
>>
>>> Anyway, maybe after most/all the functionality improvements are made we could do a maintainability pass & see whether we could get to a practically writable/readable format... I'm still not sure how likely that is, given the fact that debug info necessarily /describes/ the source the user wrote, so it's always going to be more verbose, but maybe it's achievable)
>>
>> Yeah, I think this is a great idea.
>>
>>>
>>>
>>>>
>>>>> Keeping this mapping accurate takes
>>>>> subtle logic in `lib/Linker` (see PR21910/PR22792) and it's
>>>>> expensive to compute and maintain. Attaching it directly to the
>>>>> `Function` designs away the problem.
>>>>>
>>>>> define void @foo() !dbg !0 {
>>>>> unreachable
>>>>> }
>>>>> !0 = !MDSubprogram(name: "foo", function: void ()* @foo)
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Moving onto implementation, I'd provide the same generic API that
>>>>> `Instruction` has, and wouldn't bother with the "fast path" API for
>>>>> `!dbg`. Moreover, the generic path wouldn't be slow. Since there are
>>>>> fewer functions than instructions, we can afford to store the
>>>>> attachments directly on the `Function` instead of off in the
>>>>> `LLVMContext`.
>>>>>
>>>>> It's not clear to me just how precious memory is in `Function`; IIRC
>>>>> it's sitting at 168B right now for x86-64. IMO, a `SmallVector<..., 1>`
>>>>> -- cost of 64B -- seems fine. I'll start with this if I don't hear any
>>>>> objections; we can optimize later if necessary.
>>>>>
>>>>> Otherwise, I could hack together a custom vector-like object with the
>>>>> "small string" optimization.
>>>>
>>>> Not important, but I'm missing something: what're you picturing that
>>>> would be different from/better than SmallVector?
>>>>
>>>
>>> Data storage would be:
>>>
>>> struct Data {
>>> struct value_type {
>>> unsigned Tag;
>>> TrackingMDNodeRef MD;
>>> };
>>>
>>> unsigned Capacity;
>>> union {
>>> unsigned LargeSize;
>>> unsigned SmallTag;
>>> } Unsigned;
>>>
>>> AlignedCharArrayUnion<
>>> value_type *, // LargeArray
>>> TrackingMDNodeRef // SmallMD
>>>> Pointer;
>>> };
>>> static_assert(sizeof(Data) == sizeof(void *) + sizeof(unsigned) * 2,
>>> "Wrong size");
>>>
>>> Two advantages over `SmallVector<value_type, 1>`:
>>>
>>> - 32-bits each for size and capacity (instead of 64-bits).
>>> - Domain knowledge of `value_type` allows aggressive unions.
>>>
>>> Can't provide as much API as `std::vector<>`, but the API for metadata
>>> attachments is small and opaque:
>>>
>>> bool hasMetadata(unsigned Tag) const;
>>> MDNode *getMetadata(unsigned Tag) const;
>>> void setMetadata(unsigned Tag, const MDNode *MD);
>>> void getAllMetadata(
>>> SmallVectorImpl<std::pair<unsigned, const MDNode *>> &MDs) const;
>>>
>>>>> Cost would be 16B per `Function`, with the
>>>>> same malloc/heap costs as `SmallVector<..., 1>`. But I haven't seen a
>>>>> profile that suggests this would be worth the complexity...
>>>>>
>>>>> Any opinions?
More information about the llvm-dev
mailing list