[PATCH] llvm-cov: Updated file checksum to be timestamp.

Fri Nov 15 18:04:12 PST 2013

On 15 November 2013 16:58, Yuchen Wu <yuchenericwu at hotmail.com> wrote:

> >> 2. Using the output file itself to seed hash function, which makes it
> >> deterministic. I've tried implementing this using the size of the
> >> output buffer and it was pretty simple. The problem with it, however,
> >> is that there's a lot more chance for a change to the GCNO file to go
> >> unnoticed. I also think that even if the source hadn't changed between
> >> compiles, the new binary files shouldn't be compatible with the old.
> >
> > This is obviously the correct approach. In general, it's important to
> > be able to have reproducible builds so that we can reproduce the same
> > binaries from source, builds where outputs can be cached (for instance
> > by modern non-make build systems that use the md5 of the output files),
> > etc. GCC's behaviour is silly and there's no need to replicate it.
> >
> >> "The problem with it, however, is that there's a lot more chance for a
> >> change to the GCNO file to go unnoticed."
> >
> > What do you mean by this? Are you worried that things could go into the
> > GCNO file without being an input to the hash function? The checksum is
> > a safety measure to help people avoid accidentally putting mismatching
> > GCNO and GCDA files together. Not having something be input to the hash
> > is the safe failure. We don't want the checksum to change if other
> > parts of the GCNO file weren't modified.
>
> What I meant by the last statement was that if you are doing something
> like hashing the size of the file to compute a checksum, there is a much
> higher chance that you may be using a GCNO file generated from a different
> source that just happens to be the same size. Obviously that was just an
> example, so if you guys came across a better way to seed the hash for
> Google's gcc checksum, I'd be happy to hear it :)
>

Heh. I was just looking up the answer your question, and found this comment:

   "The checksum is calculated carefully so that
   source code changes that doesn't affect the control flow graph
   won't change the checksum.
   This is to make the profile data useable across source code change."

The CFG checksum appears to be a crc32 over the index numbers of the
destination blocks for each basic block edge. I guess the blocks are
numbered consistently run to run.

The line number checksum is a checksum over the *assembler* name of the
function (C++ mangled or including __asm__ annotations), and the first line
number of the function.

The function checksum appears to be a counter.

Anyway, I've taken your arguments to heart and here is a new patch that
> uses a deterministic approach. Feedback is greatly welcome.
>


Thanks!

@@ -45,7 +45,9 @@ bool GCOVFile::read(GCOVBuffer &Buffer) {
   if (Format == GCOV::InvalidGCOV)
     return false;

+  static uint32_t Checksum;

This should be a member on GCOVFile instead? Imagine if we read a .gcda
file first, or if we create ten GCOVFiles for ten .gcno files?

-; RUN: head -c12 %T/version.gcno | grep '^oncg\*204MVLL$'
+; RUN: head -c12 %T/version.gcno | grep '^oncg\*204....$'

Use head -c8 and removing the trailing "...."?

Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131115/80971d82/attachment.html>