[lldb-dev] Handling of the ELF files missing build-ids?

Leonard Mosescu via lldb-dev lldb-dev at lists.llvm.org
Mon Aug 6 17:30:16 PDT 2018


>
> I am fine with all the above except some reservations about case C. No
> need to calculate something if it isn't useful. For case C it should be
> fine to never match as if a file has a UUID to begin with it typically
> isn't something that gets stripped in a stripped binary. So we should
> either have it or not. If breakpad does calculate a CRC32, then we need to
> know to ignore the UUID. The problem is we probably won't be able to tell
> what the UUID is: real from build ID, or from GNU debug info CRC, or CRC of
> entire file. So the minidump code will need to do something here. If a
> minidump has the linux auxv and memory map in them, then we might need to
> dig through the section information and deduce if a file matches or not
> based off the size of mapped program headers to further help with the
> matching.
>
> One other idea is to make a set of enumerations for the UUID type:
>
> class UUID {
>   enum class Type {
>     BuildID,         // A build ID from the compiler or linker
>     GNUDebugInfoCRC, // GNU debug info CRC
>     MD5,             // MD5 of entire file
>     MD5NonDebug,     // MD5 of the non debug info related bits
>     CRC32,           // CRC32 of entire file
>     Other,           // Anything else
>   };
> };
>
> The eTypeMD5NonDebug is what apple does: it MD5 checksums only the parts
> of the file that don't change with debug info or any paths found in debug
> info or symbols tables. So if you build a binary in /tmp/a or in
> /private/var/local/foo, the UUID is the same if the binary is essentially
> the same (code, data, etc).
>
> Then we can make intelligent comparisons between UUID types. Might even be
> possible for a module to have more than 1 UUID then if a binary contains a
> eTypeBuildID and a eTypeGNUDebugInfoCRC. If a tool stores its UUIDs as a
> CRC32 or MD5, then those can be calculated on the fly. The GetUUID on
> lldb_private::Module might become:
>
> const lldb_private::UUID &Module::GetUUID(UUID::Type uuid_type);
>
> Thoughts?
>
> Greg


I like the idea of UUID sub-types! This solves the problem is a more
generic fashion and it's also extensible. Interestingly enough, for
Crashpad we're considering something similar (the fabricated UUIDs would
have a different CvRecord signature)

Case C. is a bit ugly so let me elaborate: this is specific to Breakpad
minidump + ELF binaries w/o build-id. So we'd still need to have a way to
force the match of the modules in the minidump with the local files. This
ought to be an off-by-default, sharp tool which you'd only need to deal
with Breakpad minidumps, and even then it would only be a fall-back that
must be explicitly requested.

1. .gnu_debuglink separate file pointer
> <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>.
> This is where the choice of the crc algorithm comes from.


Thanks Pavel. As you noted yourself, this doesn't mean that the UUID has to
be tied to the checksum (they are exclusive options). In particular, for
performance reasons I think it's desirable to avoid calculating checksums
for every ELF module just in case.

I think we might have something already which could serve this
> purpose. Eugene added a couple of months ago a mechanism to force-load
> symbols for a given file regardless of the UUIDs
> <https://reviews.llvm.org/D35607>. It requires an explicit "target
> symbols add" command (which seems reasonable, as lldb has no way to
> tell if it's doing things right). Would something like that work for
> you?


Nice. We may have to update the C++ API, but something like this would do
the trick for case C.

To summarize the conversation so far:

1. We can fix cases A, B independent of C: if real UUIDs are missing, don't
automatically use full file CRC32 as UUID.
2. Pay attention to make sure that we don't break .gnu_debuglink or remote
debugging (thanks Pavel)
3. Multiple types/namespaces for UUIDs would be a great feature!
4. Eugene's symbol forcing trick could be extended to handle case C

Did I miss anything?

My current plan is to start with #1, then look into #4 (force symbols
match).


On Sun, Aug 5, 2018 at 12:11 PM, Pavel Labath <labath at google.com> wrote:

> Hello Leonard,
>
> while I'm in principle supportive of this idea, I think it's not going
> to be as easy as you might imagine. There are currently at least two
> mechanisms which rely on this crc UUID.
>
> 1. .gnu_debuglink separate file pointer
> <https://sourceware.org/gdb/onlinedocs/gdb/Separate-Debug-Files.html>.
> This is where the choice of the crc algorithm comes from.
>
> In short, this mechanism for debug info location works like this: The
> stripped file contains a .gnu_debuglink section.  The section contains
> a file path and a crc checksum. After reading this section the
> debugger is expected to look for the file at the given path, and then
> compute it's checksum to verify it is indeed the correct file (hasn't
> been modified).
>
> In LLDB, this is implemented somewhat differently. First we have a
> mechanism for assigning UUIDs to modules. This mechanism does one of
> two things (I'm assuming here none of the files have proper build-ids
> in them). If the file has a .gnu_debuglink section (the stripped
> file), we fetch the CRC from there, and assign that as the build-id of
> the file. If the file doesn't have this section, we **assume** it is
> going to be a target of gnu_debuglink pointer in another file, compute
> its CRC via the specified algorithm, and assign that as the build-id.
>
> Second, we have a mechanism for locating the unstripped file. This
> just fetches the path from gnu_debuglink, and compares the UUIDs of
> the two modules. This works because of the UUID algorithm in the first
> part.
>
> Now, this is not a particularly smart way of doing things and it is
> the cause of the most of your problems. However, it is completely
> avoidable. Instead of piggy-backing on the UUID mechanism, we should
> just change the search mechanism (the second part) to compute the crc
> itself, and only after it successfully finds the file referenced by
> the gnu_debuglink section. This way, we will only compute the crc only
> when absolutely necessary (i.e. for your use case, never).
>
>
> 2. Currently, for remote debugging, we assume that each module has
> some sort of a unique identifier which we can check to see whether we
> have downloaded that file already (see qModuleInfo packet). I am not
> sure what would happen if we suddenly just stopped computing a UUID
> for these files, but we at the very least lose the ability to detect
> changes to the remote files. For this item, I am not sure what would
> be the best course of action. Maybe we should just start relying on
> modification timestamp for these files?
>
> On Sat, 4 Aug 2018 at 02:17, Leonard Mosescu <mosescu at google.com> wrote:
> >
> > Greg, Mark,
> >
> > Looking at the code, LLDB falls back to a full file crc32 to create the
> module UUID if the ELF build-id is missing. This works, in the sense that
> the generated UUID does indeed identify the module.
> >
> > But there are a few problems with this approach:
> >
> > 1. First, runtime performance: a full file crc32 is a terribly
> inefficient way to generate a temporary UUID that is basically just used to
> match a local file to itself.
> > - especially when some unstripped binaries can be very large. for
> example a local chromium build produces a 5.3Gb chrome binary
> > - the crc32 implementation is decent, but single-threaded
> > - to add insult to the injury, it seems a small bug defeats the
> intention to cache the hash value so it ends up being recalculated multiple
> times
> >
> > 2. The fake UUID is not going to match any external UUID that may be
> floating around (and yet not properly embedded into the binary)
> > - an example is Breakpad, which unfortunately also attempts to make up
> UUIDs when the build-id is missing (something we'll hopefully fix soon)
> >
> > Is there a fundamental reason to calculate the full file crc32? If not I
> propose to improve this based on the following observations:
> >
> > A. Model the reality more accurately: an ELF w/o a build-id doesn't
> really have an UUID. So use a zero-length UUID in LLDB.
> > B. The full file name should be enough to prove the identity of a local
> module.
> > C. If we try to match an external UUID (ex. from a minidump) with a
> local file which does not have an UUID it may help to have an option to
> allow it to match (off by default, and only if there's no better match)
>
> I think we might have something already which could serve this
> purpose. Eugene added a couple of months ago a mechanism to force-load
> symbols for a given file regardless of the UUIDs
> <https://reviews.llvm.org/D35607>. It requires an explicit "target
> symbols add" command (which seems reasonable, as lldb has no way to
> tell if it's doing things right). Would something like that work for
> you?
>
> cheers,
> pl
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20180806/0c4b6de8/attachment-0001.html>


More information about the lldb-dev mailing list