[lldb-dev] Handling of the ELF files missing build-ids?

Sat Aug 4 10:28:48 PDT 2018

> On Aug 3, 2018, at 6:17 PM, Leonard Mosescu <mosescu at google.com> wrote:
> 
> Greg, Mark,
> 
> Looking at the code, LLDB falls back to a full file crc32 to create the module UUID if the ELF build-id is missing. This works, in the sense that the generated UUID does indeed identify the module. 
> 
> But there are a few problems with this approach:
> 
> 1. First, runtime performance: a full file crc32 is a terribly inefficient way to generate a temporary UUID that is basically just used to match a local file to itself.
> - especially when some unstripped binaries can be very large. for example a local chromium build produces a 5.3Gb chrome binary 
> - the crc32 implementation is decent, but single-threaded
> - to add insult to the injury, it seems a small bug defeats the intention to cache the hash value so it ends up being recalculated multiple times
> 
> 2. The fake UUID is not going to match any external UUID that may be floating around (and yet not properly embedded into the binary)
> - an example is Breakpad, which unfortunately also attempts to make up UUIDs when the build-id is missing (something we'll hopefully fix soon)
> 
> Is there a fundamental reason to calculate the full file crc32? If not I propose to improve this based on the following observations:
> 
> A. Model the reality more accurately: an ELF w/o a build-id doesn't really have an UUID. So use a zero-length UUID in LLDB.
> B. The full file name should be enough to prove the identity of a local module.
> C. If we try to match an external UUID (ex. from a minidump) with a local file which does not have an UUID it may help to have an option to allow it to match (off by default, and only if there's no better match)
> 
> What do you think?

I am fine with all the above except some reservations about case C. No need to calculate something if it isn't useful. For case C it should be fine to never match as if a file has a UUID to begin with it typically isn't something that gets stripped in a stripped binary. So we should either have it or not. If breakpad does calculate a CRC32, then we need to know to ignore the UUID. The problem is we probably won't be able to tell what the UUID is: real from build ID, or from GNU debug info CRC, or CRC of entire file. So the minidump code will need to do something here. If a minidump has the linux auxv and memory map in them, then we might need to dig through the section information and deduce if a file matches or not based off the size of mapped program headers to further help with the matching. 

One other idea is to make a set of enumerations for the UUID type:

class UUID {
  enum class Type {
    BuildID,         // A build ID from the compiler or linker
    GNUDebugInfoCRC, // GNU debug info CRC
    MD5,             // MD5 of entire file
    MD5NonDebug,     // MD5 of the non debug info related bits
    CRC32,           // CRC32 of entire file
    Other,           // Anything else
  };
};

The eTypeMD5NonDebug is what apple does: it MD5 checksums only the parts of the file that don't change with debug info or any paths found in debug info or symbols tables. So if you build a binary in /tmp/a or in /private/var/local/foo, the UUID is the same if the binary is essentially the same (code, data, etc).

Then we can make intelligent comparisons between UUID types. Might even be possible for a module to have more than 1 UUID then if a binary contains a eTypeBuildID and a eTypeGNUDebugInfoCRC. If a tool stores its UUIDs as a CRC32 or MD5, then those can be calculated on the fly. The GetUUID on lldb_private::Module might become:

const lldb_private::UUID &Module::GetUUID(UUID::Type uuid_type);

Thoughts?

Greg

> 
> Thanks,
> Lemo.
>