[PATCH] llvm-cov: Updated file checksum to be timestamp.

Mon Nov 18 11:22:55 PST 2013

> -----Original Message-----
> From: Nick Lewycky [mailto:nicholas at mxc.ca]
> Sent: Monday, November 18, 2013 2:19 AM
> To: Robinson, Paul
> Cc: Eric Christopher; Dave Blaikie; llvm-commits at cs.uiuc.edu
> Subject: Re: [PATCH] llvm-cov: Updated file checksum to be timestamp.
> 
> Eric Christopher wrote:
> >
> > On Nov 16, 2013 8:37 PM, "David Blaikie" <dblaikie at gmail.com
> > <mailto:dblaikie at gmail.com>> wrote:
> >  >
> >  >
> >  >
> >  >
> >  > On Fri, Nov 15, 2013 at 11:03 PM, Eric Christopher
> > <echristo at gmail.com <mailto:echristo at gmail.com>> wrote:
> >  >>
> >  >> On Fri, Nov 15, 2013 at 6:08 PM, Nick Lewycky <nlewycky at google.com
> > <mailto:nlewycky at google.com>> wrote:
> >  >> > On 15 November 2013 17:38, Robinson, Paul
> >  >> > <Paul_Robinson at playstation.sony.com
> > <mailto:Paul_Robinson at playstation.sony.com>> wrote:
> >  >> >>
> >  >> >> > -----Original Message-----
> >  >> >> > From: llvm-commits-bounces at cs.uiuc.edu
> > <mailto:llvm-commits-bounces at cs.uiuc.edu> [mailto:llvm-commits-
> > <mailto:llvm-commits->
> >  >> >> > bounces at cs.uiuc.edu <mailto:bounces at cs.uiuc.edu>] On Behalf
> Of
> > Yuchen Wu
> >  >> >> > Sent: Friday, November 15, 2013 4:59 PM
> >  >> >> > To: Nick Lewycky; Bob Wilson
> >  >> >> > Cc: llvm-commits at cs.uiuc.edu <mailto:llvm-
> commits at cs.uiuc.edu>
> >  >> >> > Subject: RE: [PATCH] llvm-cov: Updated file checksum to be
> > timestamp.
> >  >> >> >
> >  >> >> > >> 2. Using the output file itself to seed hash function,
> which
> > makes
> >  >> >> > it
> >  >> >> > >> deterministic. I've tried implementing this using the size
> > of the
> >  >> >> > >> output buffer and it was pretty simple. The problem with
> it,
> > however,
> >  >> >> > >> is that there's a lot more chance for a change to the GCNO
> > file to go
> >  >> >> > >> unnoticed. I also think that even if the source hadn't
> changed
> >  >> >> > between
> >  >> >> > >> compiles, the new binary files shouldn't be compatible
> with
> > the old.
> >  >> >> > >
> >  >> >> > > This is obviously the correct approach. In general, it's
> > important to
> >  >> >> > > be able to have reproducible builds so that we can
> reproduce
> > the same
> >  >> >> > > binaries from source, builds where outputs can be cached
> (for
> > instance
> >  >> >> > > by modern non-make build systems that use the md5 of the
> output
> >  >> >> > files),
> >  >> >> > > etc. GCC's behaviour is silly and there's no need to
> > replicate it.
> >  >> >> > >
> >  >> >> > >> "The problem with it, however, is that there's a lot more
> > chance for
> >  >> >> > a
> >  >> >> > >> change to the GCNO file to go unnoticed."
> >  >> >> > >
> >  >> >> > > What do you mean by this? Are you worried that things could
> > go into
> >  >> >> > the
> >  >> >> > > GCNO file without being an input to the hash function? The
> > checksum is
> >  >> >> > > a safety measure to help people avoid accidentally putting
> > mismatching
> >  >> >> > > GCNO and GCDA files together. Not having something be input
> > to the
> >  >> >> > hash
> >  >> >> > > is the safe failure. We don't want the checksum to change
> if
> > other
> >  >> >> > > parts of the GCNO file weren't modified.
> >  >> >> >
> >  >> >> > What I meant by the last statement was that if you are doing
> > something
> >  >> >> > like hashing the size of the file to compute a checksum,
> there
> > is a much
> >  >> >> > higher chance that you may be using a GCNO file generated
> from a
> >  >> >> > different source that just happens to be the same size.
> > Obviously that
> >  >> >> > was just an example, so if you guys came across a better way
> to
> > seed the
> >  >> >> > hash for Google's gcc checksum, I'd be happy to hear it :)
> >  >> >>
> >  >> >> Can we use an MD5 of the source file here? (Not having looked
> at the
> >  >> >> patch, sorry...) The only reason I ask is that there's a DWARF
> 5
> > feature
> >  >> >> to use MD5 instead of timestamps in the debug-line info, so
> > computing an
> >  >> >> MD5 of the source files is something we'll want to do anyway,
> > eventually.
> >  >> >
> >  >> >
> >  >> > That's reasonable, but I'd prefer to have the .gcno's checksum
> not
> > depend on
> >  >> > things which aren't in the .gcno. Fixing a typo in a comment for
> > instance
> >  >> > produces the same .o file and I'd like it to produce the same
> > .gcno file.
> >  >> >
> >  >> > That applies to DWARF too. I hope we haven't standardized
> > something that
> >  >> > requires us to emit a different .o file just because of a typo
> fix
> > in a
> >  >> > comment.
> >  >> >
> >  >>
> >  >> It's not finalized yet :)
> >  >
> >  >
> >  > But the current spec (defining package hashing in terms of type
> > hashing) doesn't include line numbers and a bunch of other stuff in
> the
> > hash - so no, fixing a typo in a comment won't change the package
> hash.
> >  >
> >
> > Yeah, this is a dwarf 5 proposal from Paul that he was bringing up.
> >
> > http://www.dwarfstd.org/ShowIssue.php?issue=130701.1
> >
> > Which looks like it is going to have this issue. Paul?

That's an md5 hash (of the non-preprocessed source, i.e. of the file
as it exists when the debugger goes to look for it) and so yes it
would change if you fix a typo.  The existing size/timestamp data
would also change, unless you wizardly keep the timestamp from changing,
so this is not a new defect of the debug-line md5 proposal.

(Note that not all typo corrections are benign from a line-info
standpoint; your this-edit-doesn't-affect-anything detector has to
be smarter than that, if you wanted to invent such a thing.)

My Monday Morning opinion is that I shouldn't have said anything;
clearly any hash or validation stamp relative to profile data should
be based on what's relevant to the profile data, which is somewhat
related to the original source file but not that tightly.

--paulr

> 
> Thanks for the link Eric. The issue includes a rationale which is
> sufficient to continue discussion. Fundamentally they need to integrate
> with an immovable object that wants to know the MD5 of the source. That
> part's not negotiable.
> 
> The reason I want to minimize changes to the .o files is that I'm
> optimizing for build systems which look at the md5's of the .o's before
> performing a link, and skip the link if the inputs haven't changed. Make
> doesn't do this, but cmake, ninja and others do. (Also, the rebuilt
> binary might be 'tblgen' which in turn produces other source files. We
> want to cut this tree if at all possible.)
> 
> VS decided to put the md5 of each source file in the debug info, and
> Paul's proposal is to do the same with llvm-produced DWARF. Bad news,
> this would break minimal rebuilding build systems. Good news, fission.
> With fission we don't need to relink the .text objects if only the debug
> objects changed. Is that sufficient?
> 
> If not, we may need to include it just to make VS integration folks
> happy, but I will insist that it stay off for my builds.
> 
> Nick
> 
> >  > (I'm not sure it's right not to include line info, though - that
> > seems sort of important to regenerate the debug info)
> >
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>