[cfe-dev] Getting involved with Clang refactoring

Tue May 29 09:40:30 PDT 2012

On Fri, 25 May 2012 17:43:13 +0200
Manuel Klimek <klimek at google.com> wrote:

> On Fri, May 25, 2012 at 5:19 PM, James K. Lowden
> <jklowden at schemamania.org>wrote:
> 
> > On Fri, 25 May 2012 07:42:46 +0200
> > Manuel Klimek <klimek at google.com> wrote:
> >
> > > > To solve the problem you describe -- to have Clang
> > > > recompile "all files that use a specific symbol"  -- one could
> > > > touch(1) all such files and then run make.
> > >
> > > Not all build systems work on system time. Some work on content
> > > digests.
> >
> > That's clearly insufficient if a header file changes.
> 
> I don't understand how this is different to timestamps when it comes
> to header files - the headers need to be in the dependency graph
> anyways.

I see.  There are two orthogonal issues: the dependency graph, and
determining which files in that graph have changed.  No matter how
"change" is measured/defined, anything affected by it -- anything
further down the graph -- needs recompilation or relinking.  

With make, we express the dependency graph directly in make's
target:source syntax, and "change" is defined in terms of the files'
mtimes.  make is quite good at converting its source language into a
DAG, but to determine what's changed it must rely on the underlying
system.  

> > > Even if this worked, the amount of stats that build systems
> > > do in ReallyLargeCodeBases to figure out what to build can be a
> > > huge cost, and can actually dwarf the time it takes to reparse 3
> > > or 4 files you care about.
> >
> > I doubt that.  Could you point me to research supporting that claim?
> 
> No. I can point you to my anecdotal experience with networked file
> systems and code bases that are so large that you don't want to check
> out everything at once.

I think I see where you're going.  

If the source control system (or, say, compilation database) maintains
fingerprints, it's possible to determine what's changed a priori,
without checking out the files to a local working directory.  If it
furthermore represents the dependency graph, it's also possible to
check out only "what's changed" prior to rebuilding, and to federate
that rebuilding. In effect, the fingerprints represent change directly,
whereas inode mtime is only a proxy, and not one that scales well along
a network.  

Each time make runs, it must stat the whole tree, meaning that its
evaluation time is O(n) with the number of files.  A database must also
make as many comparisons, but those comparisons orders of magnitude
faster than a system call.  If N is 100 or even 1000, none of that
matters very much, but if N is 1,000,000 it will be measurable.  

It's interesting to think of make divorced from the filesystem, to
imagine a make implementation that used a database of fingerprints to
answer the "is changed" question.  I'm guessing that is, in a way, what
you've done or are doing, except that it isn't as general as make.  

> > > One more reason: dependency order introduces artificial
> > > serialization.
> >
> > Are you saying the dependencies of static analysis are lesser than
> > for compilation?  I would think that's ultimately untrue, given that
> > eventually static analysis would have to be applied to the whole
> > program, even if that's not the case today.
> 
> But that doesn't mean you can't apply it in parallel. We've had great
> success using a MapReduce over ~100MLOC code to do very fast parallel
> static analysis.

Yes, but part of that "win" is due to the fact that you're not
linking.  Static analysis today is comparable to compilation, the
evaluation of a single translation unit.  

> > Nothing against you trying; I'm hoping at least one of us will learn
> > something.  :-)

If I'm on target above, then our experiment yielded unexpected but
useful results!  

--jkl