[llvm] r187213 - Add preliminary support for hashing DIEs and breaking them into

Sun Jul 28 12:03:17 PDT 2013

>>> +/// Return true if the type is appropriately scoped to be contained inside
>>> +/// its own type unit.
>>> +static bool isTypeUnitScoped(DIType Ty) {
>>
>> I wonder whether there would be a way to do this by querying the
>> underlying clang::Type to test the same property (would probably be
>> more efficient & not require the slightly problematic
>> DIScope::getContext)
>
> We don't have access to the clang type here?

Oh, right, of course.

> Or did you want to raise
> this sort of thing up to the front end?

Not necessarily - given how far away it is.

> I've had thoughts of doing
> that, just adding a named md node for things that should be in a type
> unit.

What happens to named nodes when merging modules? That would probably
then require the frontend to figure out the "things we omit from a
type unit" & figure out where to put those instead (speaking of which
- we were discussing things that would be omitted from a type unit to
improve type equivalence (member function template specializations
(member types? implicit special member functions (move/copy
ctor/assign dtor)) - where are those things going to go, then? We want
to encode them somewhere), but would this solve or otherwise address
Manman's type uniquing on module merge? If uniqable types were named
metadata we would just store references to that named metadata from
the rest of the debug info metadata.

(not sure how we'd deal with the decl/definition issue)

> It's still evolving so I figured I'd do this for now and move it
> if it made sense.

Yep.

>>> +// FIXME: These are copied and only slightly modified out of LEB128.h.
>>
>> What's the slight modification? What sort of work will be involved in
>> generalizing the two uses so they can use common code?
>>
>>> +
>>> +/// \brief Adds the unsigned in \p N to the hash in \p Hash. This also encodes
>>> +/// the unsigned as a ULEB128.
>>> +static void addULEB128ToHash(MD5 &Hash, uint64_t Value) {
>>> +  DEBUG(dbgs() << "Adding ULEB128 " << Value << " to hash.\n");
>>> +  do {
>>> +    uint8_t Byte = Value & 0x7f;
>>> +    Value >>= 7;
>>> +    if (Value != 0)
>>> +      Byte |= 0x80; // Mark this byte to show that more bytes will follow.
>>> +    Hash.update(Byte);
>
> This line is the modification. I'm open to ways to make this general :)

As we discussed offline - having these three versions in the same
place would be a good start. Then, in the absence of lambdas (or other
convenient ways to describe the "step" is a bit tricky/annoying - as
you said, the function's short enough we could just stamp out the body
of these 4 functions with a macro where the parameter is an expression
or statement of "what to do with this byte".

Also, there's another copy of such a function in MCAsmInfo.cpp, and
the copies in LEB128.h should not be static, just inline.

>
>>> +  // For each surrounding type or namespace...
>>
>> What surrounding things can we expect that aren't namespaces and types?
>
> The compile unit? But as you ask later, this is actually a quote.

For myself it wasn't obvious what the condition was avoiding - perhaps
it'd be helpful. If it's only the compile_unit, that should always be
the top level, so shouldn't we just be able to replace this condition
and the next one with "if it's not a compile unit, recurse" (or loop,
if we do it with a loop instead) - with an assert that it should be a
namespace, class, or struct? That seems a bit more
explicit/self-documenting.

>>> +  if (Tag != dwarf::DW_TAG_namespace && Tag != dwarf::DW_TAG_class_type &&
>>> +      Tag != dwarf::DW_TAG_structure_type)
>>> +    return;
>>> +
>>> +  // ... beginning with the outermost such construct...
>>> +  if (Parent->getParent() != NULL)
>>> +    addParentContextToHash(Hash, Parent->getParent());
>>
>> I'd sort of rather a loop than a recursion, though I realize the
>> recursion won't be terribly deep.
>>
>
> Heh. I'd originally written it that way

Any particular reason you chose to rewrite it recursively?

>
>>> +
>>> +  // Append the letter "C" to the sequence.
>>
>> No real explanation as to why "C" (I guess 'Context') but if these are
>> quotes from the DWARF standard, a reference to the chapter/verse at
>> the start of the sequence of comments would be nice.
>
> I do have chapter/section above the addDIEODRSignature function. I can
> probably break it down more.

Yeah, you could compare to how we write spec quotations in Clang for
C++ stuff. To take a random example, search for "expr.new]p15" in
SemaExprCXX.cpp.

>>> +  addLetterToHash(Hash, "C");
>>> +
>>> +  // Followed by the DWARF tag of the construct.
>>> +  addULEB128ToHash(Hash, Parent->getTag());
>>> +
>>> +  // Then the name, taken from the DW_AT_name attribute.
>>> +  StringRef Name = getDIEStringAttr(Parent, dwarf::DW_AT_name);
>>> +  if (!Name.empty())
>>> +    addStringToHash(Hash, Name);
>>
>> Did we want to push-back any on the issue of hashing null terminators?
>
> *shrug* The only reason I'm doing it at the moment is to test the
> algorithm against the gcc output, for type units and whole compile
> unit hashing there's no way we're going to match gcc anyhow (different
> inputs) so it's basically up to us whether or not we want to hash them
> at all.

That's true of type unit/whole CU hashing - but ODR hashing is the
inverse in the sense that when these things don't match, we'll produce
a warning to the user, so I assume we need to make them match between
compilers, right? (if TU and CU hashes don't match, the user just gets
fatter debug info - and, since the debug info is actually different,
there's nothing we can do about that)

If we can have different ODR hashes, then can we add a FIXME to remove
the null terminators once you're satisfied the algorithm is working
correctly?
(& if we can't have different ODR hashes, can we have a FIXME to fix
GCC/DWARF to ensure we don't hash them?)

>>> +    addParentContextToHash(Hash, Parent);
>>> +
>>> +  // Add the current DIE information.
>>> +
>>> +  // Add the DWARF tag of the DIE.
>>> +  addULEB128ToHash(Hash, Die->getTag());
>>> +
>>> +  // Add the name of the type to the hash.
>>> +  addStringToHash(Hash, getDIEStringAttr(Die, dwarf::DW_AT_name));
>>
>> Rather than searching for names repeatedly (do we need to search for
>> other attributes by name?) would it be worth caching that attribute
>> specifically in the DIE? (just have a DIEString pointer in the DIE
>> specifically for the name) Probably not, I guess.
>
> Not really, and we're definitely going to be grabbing lots of
> attributes with string representations as we hash whole DIEs - and
> unfortunately it's a sorted list. :\

I guess you meant it's not sorted?

>>> +  // ... take the least significant 8 bytes and store those as the attribute.
>>> +  uint64_t Signature;
>>> +  memcpy(&Signature, &Result[8], 8);
>>
>> Any discussion on this with the DWARF committee? I assume computing a
>> big hash & then just taking some of the bytes isn't really the best
>> way to use a hashing algorithm.
>
> I got them to admit that they want the least significant 8 bytes
> rather than "the lower 8 bytes". But the field is fixed now at this
> point, it's pretty proof against collisions etc. That said, it bothers
> me a bit as well.

I wonder if there's a more efficient way to compute the hash if only
the lower 8 bytes are required in the end anyway - but it's not likely
to be worth the engineering effort to attempt to create such a thing.

>>> +  // For types that we'd like to move to type units or ODR check go ahead
>>> +  // and either move the types out or add the ODR attribute now.
>>> +  // FIXME: Do type splitting.
>>> +  for (unsigned i = 0, e = TypeUnits.size(); i != e; ++i) {
>>> +    MD5 Hash;
>>> +    DIE *Die = TypeUnits[i];
>>> +    // If we're in C++ and we want to generate the hash then go ahead and do
>>> +    // that now.
>>
>> That phrasing leads to the question "And if we're not in C++ and want
>> to generate a hash?" - my understanding is that we never want to
>> generate a hash when we're not in C++ (because C has no ODR-like
>> thing), but the comment reads a bit oddly if that's the intent. Not
>> sure of the right wording yet.
>
> If you come up with one... :)

// If we're dealing with a language with unique (ODR) types, hash any of those

(this isn't great in that it doesn't describe the "if we want to"
(which I suppose refers to the flag - which is only temporary) and C++
doesn't actually talk about "ODR types", but I think it still gets the
idea across)

>>> +; void foo(void) {
>>> +;   struct baz {};
>>> +;   baz b;
>>> +; }
>>> +; namespace echidna {
>>> +; namespace capybara {
>>> +; namespace mongoose {
>>
>> Got a little bored trying to find namespace names? There are lists of
>> metasyntactic variables to call on you know ;) (also, sometimes we use
>> a top-level namespace "test1", "test2", etc to separate test cases &
>> avoid their names polluting each other - though adding namespaces
>> isn't an irrelevant change in this test case)
>>
>
> Yeah, I wanted something I could recognize on its way out and figured
> I'd be a bit creative. Looking at test1..N gets old after a while :)

It can make it easier to follow examples (reason we have metasyntactic
variables is so readers don't get confused by wondering whether
there's extra semantics they're meant to be associating with the
identifiers), but it's hardly a dealbreaker in this instance.

>>> +; CHECK: DW_TAG_structure_type
>>> +; CHECK-NEXT: debug_str{{.*}}"bar"
>>> +; CHECK: DW_AT_GNU_odr_signature [DW_FORM_data8] (0x200520c0d5b90eff)
>>> +; CHECK: DW_TAG_namespace
>>> +; CHECK-NEXT: debug_str{{.*}}"echidna"
>>> +; CHECK: DW_TAG_namespace
>>> +; CHECK-NEXT: debug_str{{.*}}"capybara"
>>> +; CHECK: DW_TAG_namespace
>>> +; CHECK-NEXT: debug_str{{.*}}"mongoose"
>>> +; CHECK: DW_TAG_class_type
>>> +; CHECK-NEXT: debug_str{{.*}}"fluffy"
>>> +; CHECK: DW_AT_GNU_odr_signature [DW_FORM_data8]   (0x9a0124d5a0c21c52)
>>> +; CHECK: DW_TAG_structure_type
>>> +; CHECK-NEXT: debug_str{{.*}}"baz"
>>> +; CHECK-NOT: DW_AT_GNU_odr_signature
>>> +; FIXME: we may want to generate debug info for walrus, but still no hash.
>>
>> Why?
>
> It's in an anonymous namespace.

Again, could be helpful to include that in the comment somehow. That
almost seems bug-worthy - we have a global variable of that type yet
no debug info for it? That seems strange.