[llvm] r187213 - Add preliminary support for hashing DIEs and breaking them into

Sun Jul 28 13:16:36 PDT 2013

>> I've had thoughts of doing
>> that, just adding a named md node for things that should be in a type
>> unit.
>
> What happens to named nodes when merging modules? That would probably
> then require the frontend to figure out the "things we omit from a
> type unit" & figure out where to put those instead (speaking of which
> - we were discussing things that would be omitted from a type unit to
> improve type equivalence (member function template specializations
> (member types? implicit special member functions (move/copy
> ctor/assign dtor)) - where are those things going to go, then? We want
> to encode them somewhere), but would this solve or otherwise address
> Manman's type uniquing on module merge? If uniqable types were named
> metadata we would just store references to that named metadata from
> the rest of the debug info metadata.
>

The idea I had here was to put the types that we'd like to put into
type units into a named md node (which should have elements appended).
It won't solve the type uniquing on module merge problem, that still
needs the solution I outlined for her there. Anyhow, it's mostly idle
thoughts of direction.

> (not sure how we'd deal with the decl/definition issue)

That'd be harder.

>>
>> This line is the modification. I'm open to ways to make this general :)
>
> As we discussed offline - having these three versions in the same
> place would be a good start. Then, in the absence of lambdas (or other
> convenient ways to describe the "step" is a bit tricky/annoying - as
> you said, the function's short enough we could just stamp out the body
> of these 4 functions with a macro where the parameter is an expression
> or statement of "what to do with this byte".
>

*nod* I just thought adding MD5 to LEB128.h seemed a bit intrusive,
but I'm not wedded to keeping it apart.

> Also, there's another copy of such a function in MCAsmInfo.cpp, and
> the copies in LEB128.h should not be static, just inline.
>

Heh. Another copy! Good point there as well.

>>
>>>> +  // For each surrounding type or namespace...
>>>
>>> What surrounding things can we expect that aren't namespaces and types?
>>
>> The compile unit? But as you ask later, this is actually a quote.
>
> For myself it wasn't obvious what the condition was avoiding - perhaps
> it'd be helpful. If it's only the compile_unit, that should always be
> the top level, so shouldn't we just be able to replace this condition
> and the next one with "if it's not a compile unit, recurse" (or loop,
> if we do it with a loop instead) - with an assert that it should be a
> namespace, class, or struct? That seems a bit more
> explicit/self-documenting.

Oh, also functions. Missed that one in my description to you as well
which makes it more complicated. Sorry about that.

>>> I'd sort of rather a loop than a recursion, though I realize the
>>> recursion won't be terribly deep.
>>>
>>
>> Heh. I'd originally written it that way
>
> Any particular reason you chose to rewrite it recursively?

Made it more compact - though not terribly. I can undo it if you'd like.

>> I do have chapter/section above the addDIEODRSignature function. I can
>> probably break it down more.
>
> Yeah, you could compare to how we write spec quotations in Clang for
> C++ stuff. To take a random example, search for "expr.new]p15" in
> SemaExprCXX.cpp.
>

Nod. I've been meaning to do this here, just had some emergencies this week.

>>> Did we want to push-back any on the issue of hashing null terminators?
>>
>> *shrug* The only reason I'm doing it at the moment is to test the
>> algorithm against the gcc output, for type units and whole compile
>> unit hashing there's no way we're going to match gcc anyhow (different
>> inputs) so it's basically up to us whether or not we want to hash them
>> at all.
>
> That's true of type unit/whole CU hashing - but ODR hashing is the
> inverse in the sense that when these things don't match, we'll produce
> a warning to the user, so I assume we need to make them match between
> compilers, right? (if TU and CU hashes don't match, the user just gets
> fatter debug info - and, since the debug info is actually different,
> there's nothing we can do about that)
>

It helps to make them match ultimately here yes. Compatibility is nice.

> If we can have different ODR hashes, then can we add a FIXME to remove
> the null terminators once you're satisfied the algorithm is working
> correctly?
> (& if we can't have different ODR hashes, can we have a FIXME to fix
> GCC/DWARF to ensure we don't hash them?)

We can, but I'll caution you against too much reading into things on
the ODR hashing. It was a small gcc feature that used the hashing that
it didn't cost much additional to add, I don't want to get bogged down
on whether or not hashing the null terminator there is the right thing
to do. It's basically just a strict subset of the whole DIE and whole
CU hashing and an incremental change. The hashing of the 0 byte is
actually in the standard and I can get it changed, I'm just not really
sure it's worth it.

>> Not really, and we're definitely going to be grabbing lots of
>> attributes with string representations as we hash whole DIEs - and
>> unfortunately it's a sorted list. :\
>
> I guess you meant it's not sorted?

The ordering of DIEs to hash has a particular sort. The order that
they're added to the DIE is not sorted. We could, if we wanted, sort
things as they're added to the DIE.

>
>>>> +  // ... take the least significant 8 bytes and store those as the attribute.
>>>> +  uint64_t Signature;
>>>> +  memcpy(&Signature, &Result[8], 8);
>>>
>>> Any discussion on this with the DWARF committee? I assume computing a
>>> big hash & then just taking some of the bytes isn't really the best
>>> way to use a hashing algorithm.
>>
>> I got them to admit that they want the least significant 8 bytes
>> rather than "the lower 8 bytes". But the field is fixed now at this
>> point, it's pretty proof against collisions etc. That said, it bothers
>> me a bit as well.
>
> I wonder if there's a more efficient way to compute the hash if only
> the lower 8 bytes are required in the end anyway - but it's not likely
> to be worth the engineering effort to attempt to create such a thing.
>

No idea :)

>>>> +  // For types that we'd like to move to type units or ODR check go ahead
>>>> +  // and either move the types out or add the ODR attribute now.
>>>> +  // FIXME: Do type splitting.
>>>> +  for (unsigned i = 0, e = TypeUnits.size(); i != e; ++i) {
>>>> +    MD5 Hash;
>>>> +    DIE *Die = TypeUnits[i];
>>>> +    // If we're in C++ and we want to generate the hash then go ahead and do
>>>> +    // that now.
>>>
>>> That phrasing leads to the question "And if we're not in C++ and want
>>> to generate a hash?" - my understanding is that we never want to
>>> generate a hash when we're not in C++ (because C has no ODR-like
>>> thing), but the comment reads a bit oddly if that's the intent. Not
>>> sure of the right wording yet.
>>
>> If you come up with one... :)
>
> // If we're dealing with a language with unique (ODR) types, hash any of those
>
> (this isn't great in that it doesn't describe the "if we want to"
> (which I suppose refers to the flag - which is only temporary) and C++
> doesn't actually talk about "ODR types", but I think it still gets the
> idea across)
>

*shrug* Sure.

>>>> +; void foo(void) {
>>>> +;   struct baz {};
>>>> +;   baz b;
>>>> +; }
>>>> +; namespace echidna {
>>>> +; namespace capybara {
>>>> +; namespace mongoose {
>>>
>>> Got a little bored trying to find namespace names? There are lists of
>>> metasyntactic variables to call on you know ;) (also, sometimes we use
>>> a top-level namespace "test1", "test2", etc to separate test cases &
>>> avoid their names polluting each other - though adding namespaces
>>> isn't an irrelevant change in this test case)
>>>
>>
>> Yeah, I wanted something I could recognize on its way out and figured
>> I'd be a bit creative. Looking at test1..N gets old after a while :)
>
> It can make it easier to follow examples (reason we have metasyntactic
> variables is so readers don't get confused by wondering whether
> there's extra semantics they're meant to be associating with the
> identifiers), but it's hardly a dealbreaker in this instance.
>

Right. I know why they exist, but yeah, it's not really important here.

>>>> +; CHECK: DW_TAG_structure_type
>>>> +; CHECK-NEXT: debug_str{{.*}}"bar"
>>>> +; CHECK: DW_AT_GNU_odr_signature [DW_FORM_data8] (0x200520c0d5b90eff)
>>>> +; CHECK: DW_TAG_namespace
>>>> +; CHECK-NEXT: debug_str{{.*}}"echidna"
>>>> +; CHECK: DW_TAG_namespace
>>>> +; CHECK-NEXT: debug_str{{.*}}"capybara"
>>>> +; CHECK: DW_TAG_namespace
>>>> +; CHECK-NEXT: debug_str{{.*}}"mongoose"
>>>> +; CHECK: DW_TAG_class_type
>>>> +; CHECK-NEXT: debug_str{{.*}}"fluffy"
>>>> +; CHECK: DW_AT_GNU_odr_signature [DW_FORM_data8]   (0x9a0124d5a0c21c52)
>>>> +; CHECK: DW_TAG_structure_type
>>>> +; CHECK-NEXT: debug_str{{.*}}"baz"
>>>> +; CHECK-NOT: DW_AT_GNU_odr_signature
>>>> +; FIXME: we may want to generate debug info for walrus, but still no hash.
>>>
>>> Why?
>>
>> It's in an anonymous namespace.
>
> Again, could be helpful to include that in the comment somehow. That
> almost seems bug-worthy - we have a global variable of that type yet
> no debug info for it? That seems strange.

We may wish to emit debug info for w (and the type walrus), we
definitely don't want to emit a hash. I'm not sure which thing you
considered bug worthy. The former is, the latter isn't :)

-eric