[llvm] r187213 - Add preliminary support for hashing DIEs and breaking them into

Fri Jul 26 13:49:34 PDT 2013

>> +  /// DIEString - A container for string values.
>> +  ///
>> +  class DIEString : public DIEValue {
>
> The addition of DIEString probably could've been a separate commit,
> but it's fairly non-invasive anyway.

Yep. That was my thought.

>> +/// Return true if the type is appropriately scoped to be contained inside
>> +/// its own type unit.
>> +static bool isTypeUnitScoped(DIType Ty) {
>
> I wonder whether there would be a way to do this by querying the
> underlying clang::Type to test the same property (would probably be
> more efficient & not require the slightly problematic
> DIScope::getContext)

We don't have access to the clang type here? Or did you want to raise
this sort of thing up to the front end? I've had thoughts of doing
that, just adding a named md node for things that should be in a type
unit. It's still evolving so I figured I'd do this for now and move it
if it made sense.

>> +  // Iterate through all the attributes until we find the one we're
>> +  // looking for, if we can't find it return an empty string.
>> +  for (size_t i = 0; i < Values.size(); ++i) {
>> +    if (Abbrevs.getData()[i].getAttribute() == Attr) {
>> +      DIEValue *V = Values[i];
>> +      assert(isa<DIEString>(V) && "String requested. Not a string.");
>> +      DIEString *S = cast<DIEString>(V);
>> +      return S->getString();
>> +    }
>> +  }
>> +  return StringRef("");
>
> return ""; should be fine, or return StringRef(); (StringRef has no
> "null" state so far as I know - just zero-length)
>

Yes, agreed. Honestly (as you mention later), it should just be an
Optional, but I hadn't gotten there yet as the API is still in
progress.

>> +}
>> +
>> +/// \brief Adds the string in \p Str to the hash in \p Hash. This also hashes
>> +/// a trailing NULL with the string.
>> +static void addStringToHash(MD5 &Hash, StringRef Str) {
>> +  DEBUG(dbgs() << "Adding string " << Str << " to hash.\n");
>> +  HashValue SVal((const uint8_t *)Str.data(), Str.size());
>> +  const uint8_t NB = '\0';
>> +  HashValue NBVal((const uint8_t *)&NB, 1);
>> +  Hash.update(SVal);
>> +  Hash.update(NBVal);
>> +}
>> +
>> +/// \brief Adds the character string in \p Str to the hash in \p Hash. This does
>> +/// not hash a trailing NULL on the character.
>> +static void addLetterToHash(MD5 &Hash, StringRef Str) {
>
> Seems like a strange name for it - since a StringRef contains things
> other than letters and contains more or less than one letter...
>

I've decided to nuke it :)

> Also wondering whether having StringRef APIs for hash would make sense
> since these two utility functions (at least the "hash the bytes of
> this StringRef") seem like pretty common/basic functionality, having
> to go through c-style casts to use them seems a bit lame.
>
> I would expect:
> Hash.Update(Str);
> Hash.Update('\0');
>
> for the "addStringToHash" (& would call it
> "addNullTerminatedStringToHash" or just write those two lines wherever
> it matters - it's probably easier to eyeball the fact that it's null
> terminating with those two lines than to pick a sufficiently
> descriptive function name)
>

Makes sense...

> & omit this addLetterToHash function entirely in favor of an update
> function that takes a StringRef (or, if it's really not the purview of
> the MD5 class to ever consider characters, maybe a free function
> update(MD5&, StringRef) in a more general place)
>
> This comes back to hash API design - didn't Chandler have some
> proto-C++-standardizable hash API (with function or operator
> overloading extensibility to allow for hashing of more/new/different
> things). Any reason MD5 wouldn't fit into that? (I vaguely remember
> mentioning this before, but don't recall the answer)
>

I think he didn't want it overlapping, but I could be wrong. Was going
to add things to the hash as it made sense. In this case I've waffled
back and forth a few times on adding things to the hash class or doing
them out of line. A StringRef one does make sense, but the occasional
lack of trailing \0 makes it more complicated.

>> +// FIXME: These are copied and only slightly modified out of LEB128.h.
>
> What's the slight modification? What sort of work will be involved in
> generalizing the two uses so they can use common code?
>
>> +
>> +/// \brief Adds the unsigned in \p N to the hash in \p Hash. This also encodes
>> +/// the unsigned as a ULEB128.
>> +static void addULEB128ToHash(MD5 &Hash, uint64_t Value) {
>> +  DEBUG(dbgs() << "Adding ULEB128 " << Value << " to hash.\n");
>> +  do {
>> +    uint8_t Byte = Value & 0x7f;
>> +    Value >>= 7;
>> +    if (Value != 0)
>> +      Byte |= 0x80; // Mark this byte to show that more bytes will follow.
>> +    Hash.update(Byte);

This line is the modification. I'm open to ways to make this general :)

>> +  // For each surrounding type or namespace...
>
> What surrounding things can we expect that aren't namespaces and types?

The compile unit? But as you ask later, this is actually a quote.

>> +  if (Tag != dwarf::DW_TAG_namespace && Tag != dwarf::DW_TAG_class_type &&
>> +      Tag != dwarf::DW_TAG_structure_type)
>> +    return;
>> +
>> +  // ... beginning with the outermost such construct...
>> +  if (Parent->getParent() != NULL)
>> +    addParentContextToHash(Hash, Parent->getParent());
>
> I'd sort of rather a loop than a recursion, though I realize the
> recursion won't be terribly deep.
>

Heh. I'd originally written it that way

>> +
>> +  // Append the letter "C" to the sequence.
>
> No real explanation as to why "C" (I guess 'Context') but if these are
> quotes from the DWARF standard, a reference to the chapter/verse at
> the start of the sequence of comments would be nice.

I do have chapter/section above the addDIEODRSignature function. I can
probably break it down more.

>
>> +  addLetterToHash(Hash, "C");
>> +
>> +  // Followed by the DWARF tag of the construct.
>> +  addULEB128ToHash(Hash, Parent->getTag());
>> +
>> +  // Then the name, taken from the DW_AT_name attribute.
>> +  StringRef Name = getDIEStringAttr(Parent, dwarf::DW_AT_name);
>> +  if (!Name.empty())
>> +    addStringToHash(Hash, Name);
>
> Did we want to push-back any on the issue of hashing null terminators?

*shrug* The only reason I'm doing it at the moment is to test the
algorithm against the gcc output, for type units and whole compile
unit hashing there's no way we're going to match gcc anyhow (different
inputs) so it's basically up to us whether or not we want to hash them
at all.

> That's presumably going to lower the entropy of the hash input
> somewhat & not really add anything. I guess it was just a convenience
> in GCC that lead to this being the chosen thing to do?

Almost assuredly.

>
>> +}
>> +
>> +/// This is based on the type signature computation given in section 7.27 of the
>> +/// DWARF4 standard. It is the md5 hash of a flattened description of the DIE.
>> +static void addDIEODRSignature(MD5 &Hash, CompileUnit *CU, DIE *Die) {
>> +
>> +  // Add the contexts to the hash.
>> +  DIE *Parent = Die->getParent();
>> +  if (Parent)
>
> Generally we put the variable in the condition like this:
>
> if (DIE *Parent = Die->getParent())
>
> to reduce scope.

Ah yes, thanks.

>
>> +    addParentContextToHash(Hash, Parent);
>> +
>> +  // Add the current DIE information.
>> +
>> +  // Add the DWARF tag of the DIE.
>> +  addULEB128ToHash(Hash, Die->getTag());
>> +
>> +  // Add the name of the type to the hash.
>> +  addStringToHash(Hash, getDIEStringAttr(Die, dwarf::DW_AT_name));
>
> Rather than searching for names repeatedly (do we need to search for
> other attributes by name?) would it be worth caching that attribute
> specifically in the DIE? (just have a DIEString pointer in the DIE
> specifically for the name) Probably not, I guess.

Not really, and we're definitely going to be grabbing lots of
attributes with string representations as we hash whole DIEs - and
unfortunately it's a sorted list. :\

>
>> +
>> +  // Now get the result.
>> +  MD5::MD5Result Result;
>> +  Hash.final(Result);
>
> Weird name for the function - shoudln't it be a verb?

Fair question, I'll fix.

>
>> +
>> +  // ... take the least significant 8 bytes and store those as the attribute.
>> +  uint64_t Signature;
>> +  memcpy(&Signature, &Result[8], 8);
>
> Any discussion on this with the DWARF committee? I assume computing a
> big hash & then just taking some of the bytes isn't really the best
> way to use a hashing algorithm.

I got them to admit that they want the least significant 8 bytes
rather than "the lower 8 bytes". But the field is fixed now at this
point, it's pretty proof against collisions etc. That said, it bothers
me a bit as well.

>
>> +
>> +  // FIXME: This should be added onto the type unit, not the type, but this
>> +  // works as an intermediate stage.
>> +  CU->addUInt(Die, dwarf::DW_AT_GNU_odr_signature, dwarf::DW_FORM_data8,
>> +              Signature);
>> +}
>> +
>> +/// Return true if the current DIE is contained within an anonymous namespace.
>> +static bool isContainedInAnonNamespace(DIE *Die) {
>> +  DIE *Parent = Die->getParent();
>> +
>> +  while (Parent) {
>> +    if (Die->getTag() == dwarf::DW_TAG_namespace &&
>> +        getDIEStringAttr(Die, dwarf::DW_AT_name) == "")
>
> Should we be representing anonymous namespaces with no name attribute
> at all rather than a zero-length string...
>
> Oh, but maybe we are & that's how you're returning invalid from
> getDIEStringAttr. Hrm. Personally I kind of like the difference
> between "empty string" (which I rarely think should be a special case)
> and "no string at all", but I know not everyone agrees & lots of
> people are perfectly happy for "empty string" and "no string at all"
> to be the same thing. (I certainly don't like "no string at all" to be
> built in to string types like StringRef - but instead prefer it to be
> added orthogonally with Optional<StringRef> or similar)

Agreed. And you've pretty much outlined my entire thought process on it.

>> +  // For types that we'd like to move to type units or ODR check go ahead
>> +  // and either move the types out or add the ODR attribute now.
>> +  // FIXME: Do type splitting.
>> +  for (unsigned i = 0, e = TypeUnits.size(); i != e; ++i) {
>> +    MD5 Hash;
>> +    DIE *Die = TypeUnits[i];
>> +    // If we're in C++ and we want to generate the hash then go ahead and do
>> +    // that now.
>
> That phrasing leads to the question "And if we're not in C++ and want
> to generate a hash?" - my understanding is that we never want to
> generate a hash when we're not in C++ (because C has no ODR-like
> thing), but the comment reads a bit oddly if that's the intent. Not
> sure of the right wording yet.

If you come up with one... :)

>> +; struct bar {};
>> +; struct bar b;
>
> Weird to use a prefix on a variable declaration in C++ code (ie: use
> "bar b;" rather than "struct bar b;").
>

Ah, true. Old habit. I'll fix.

>> +; void foo(void) {
>> +;   struct baz {};
>> +;   baz b;
>> +; }
>> +; namespace echidna {
>> +; namespace capybara {
>> +; namespace mongoose {
>
> Got a little bored trying to find namespace names? There are lists of
> metasyntactic variables to call on you know ;) (also, sometimes we use
> a top-level namespace "test1", "test2", etc to separate test cases &
> avoid their names polluting each other - though adding namespaces
> isn't an irrelevant change in this test case)
>

Yeah, I wanted something I could recognize on its way out and figured
I'd be a bit creative. Looking at test1..N gets old after a while :)

>> +; class fluffy {
>> +;   int a;
>> +;   int b;
>> +; };
>> +; fluffy animal;
>> +; }
>> +; }
>> +; }
>> +; namespace {
>> +; struct walrus {};
>> +; }
>> +; walrus w;
>> +
>> +; Check that we generate a hash for bar, that it is a particular value and
>> +; that we don't generate a hash for baz or walrus.
>
> Mightn't hurt to briefly explain /why/ we should/shouldn't have a hash
> for each case.

Sure.

>
>> +; CHECK: DW_TAG_structure_type
>> +; CHECK-NEXT: debug_str{{.*}}"bar"
>> +; CHECK: DW_AT_GNU_odr_signature [DW_FORM_data8] (0x200520c0d5b90eff)
>> +; CHECK: DW_TAG_namespace
>> +; CHECK-NEXT: debug_str{{.*}}"echidna"
>> +; CHECK: DW_TAG_namespace
>> +; CHECK-NEXT: debug_str{{.*}}"capybara"
>> +; CHECK: DW_TAG_namespace
>> +; CHECK-NEXT: debug_str{{.*}}"mongoose"
>> +; CHECK: DW_TAG_class_type
>> +; CHECK-NEXT: debug_str{{.*}}"fluffy"
>> +; CHECK: DW_AT_GNU_odr_signature [DW_FORM_data8]   (0x9a0124d5a0c21c52)
>> +; CHECK: DW_TAG_structure_type
>> +; CHECK-NEXT: debug_str{{.*}}"baz"
>> +; CHECK-NOT: DW_AT_GNU_odr_signature
>> +; FIXME: we may want to generate debug info for walrus, but still no hash.
>
> Why?

It's in an anonymous namespace.

-eric