r220493 - Add a "signature" to AST files to verify that they haven't changed

Douglas Gregor dgregor at apple.com
Fri Oct 24 15:00:41 PDT 2014


> On Oct 24, 2014, at 1:13 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> 
> On Fri, Oct 24, 2014 at 8:47 AM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
> 
>> On Oct 24, 2014, at 8:19 AM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>> 
>> On Fri, Oct 24, 2014 at 7:18 AM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>> 
>>> On Oct 23, 2014, at 6:10 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>> 
>>> On Thu, Oct 23, 2014 at 5:36 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>> 
>>>> On Oct 23, 2014, at 4:52 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>>> 
>>>> On Thu, Oct 23, 2014 at 3:54 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>>> 
>>>>> On Oct 23, 2014, at 3:34 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>>>> 
>>>>> On Thu, Oct 23, 2014 at 11:41 AM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>>>> 
>>>>>> On Oct 23, 2014, at 11:24 AM, David Blaikie <dblaikie at gmail.com <mailto:dblaikie at gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Oct 23, 2014 at 11:05 AM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>>>>> Author: benlangmuir
>>>>>> Date: Thu Oct 23 13:05:36 2014
>>>>>> New Revision: 220493
>>>>>> 
>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=220493&view=rev <http://llvm.org/viewvc/llvm-project?rev=220493&view=rev>
>>>>>> Log:
>>>>>> Add a "signature" to AST files to verify that they haven't changed
>>>>>> 
>>>>>> Since the order of the IDs in the AST file (e.g. DeclIDs, SelectorIDs)
>>>>>> is not stable,
>>>>>> 
>>>>>> Um... when you say "not stable" what do you mean? (& when you say "nothing changed" what do you mean)
>>>>> 
>>>>> Identifiers, decls, macros, etc. are referred to by ID numbers, which depend on the order the entity is looked at during serialization.  This in turn often depends on hash-table iteration order and is not stable across runs.  When a module refers to a name in one of its imports, it will find the wrong name if the order has changed since it was first built.
>>>>> 
>>>>> By “not changed” I mean that the sources have not changed.
>>>>> 
>>>>>> 
>>>>>> It's pretty important that the compiler produces identical bytes given identical input. We're pretty clear about fixing any cases where it doesn't (but maybe we haven't been as rigorous about that with AST serialization? I don't know - I'b be a bit scared if we haven’t)
>>>>> 
>>>>> Maybe someone more familiar with the history of the ASTWriter can comment? +rsmith, +dgregor, +akyrtzidis
>>>>> 
>>>>> Non-deterministic output is a bug, but it's been lower priority than other stuff I've been looking at up until now. It's likely to get near the top of my priority stack fairly soon. So... this patch is /technically/ making that existing problem worse by adding more nondeterminism to the output, but since it's also fixing an immediate problem, I suggest we keep this change in place until we've fixed the other nondeterminism, then revert it.
>>>>> 
>>>>> Does that seem reasonable?
>>>> 
>>>> How do you intend to maintain the deterministic output?  My first concern is that if it later regresses, the kind of bug report we will get is ‘I used modules and this name lookup failed’.
>>>> 
>>>> I was intending on adding a test that builds some module twice and checks that it gives the same output.
>>> 
>>> 
>>>> 
>>>> Also, we can rebuild a module and it legitimately changes (you rename ‘A’ to ‘B’ inside one of the headers), but still the resulting pcm doesn’t have a different size or mod-time from the previous version.
>>>> 
>>>> Hmm, aren't we already checking that the mtimes of the input files are the same as when we built the module before we use a module from the cache?
>>> 
>>> Yes, but a pcm may be built apparently "at the same time” as its header file is modified, if the precision of mtime is too low. So we can still write a pcm “at the same time” as we last wrote it, leading to this problem of loading an out-of-date module.
>>> 
>>> Only if the header file is written twice, both writes have the same mtime, and we pick up the file between those two writes. All build systems that use mtimes to determine whether they should rebuild are vulnerable to this problem. (Most are also vulnerable to the input changing between being read and the output changing, too, since they simply check mtime(output) < mtime(inputs) to determine if a rebuild is necessary.)
>>> 
>>> I’m not really concerned by a header being modified more than once in a time period less than the mtime precision, but having the pcm built immediately before and after the header changes is more worrying.
>>> 
>>> If we are checking that the mtimes of the input files are the same as they were when we built the module, the only problems seem to be in your "not really concerned" case, right?
>> 
>> I don’t think so.  A.pcm imports C.pcm, and B.pcm imports C.pcm.
>> 
>> * At the beginning C.h has mtime T1.
>> * We build A and C: mtime for A.pcm, C.pcm is T2; A.pcm expects the mtime of C.pcm to be T2 and C.pcm expects the mtime of C.h to be T1.
>> * immediately change C.h, it’s mtime is T2
>> * Now we build B, and because C.h has changed we rebuild C.pcm.  This can happen quickly enough that the mtime of B.pcm and C.pcm are both T2.
>> * If we later import A, we will not know to rebuild it, because C.pcm and C.h both have the expected mtime T2.
>> 
>> Thanks, I understand the disconnect now. I was imagining we'd also store the mtimes of C.pcm's inputs in A.pcm. I agree that storing them in C.pcm doesn't let A.pcm tell if it's got the right C.pcm. Could we address this by using the hash of that mtime data plus the signatures of C's imports as the signature of C?
> 
> I think that would be sufficient if we can guarantee that the pcm is deterministically created.  I admit that I’m still worried about that:
> 
>>>> I was intending on adding a test that builds some module twice and checks that it gives the same output.
> 
> Can we exhaustively check that the output is deterministic? We can improve the testing as we find more issues, but I’m concerned at how mysterious the failure mode is.  When I discovered that this was a problem it was after several days trying to debug a name lookup failure.  I was lucky that it showed up in the method pool, which I eventually dumped out and saw that there were selectors that had mismatched identifiers in them.  I don’t know what other crazy things can happen or if I would recognize them as symptoms of non-deterministic output.
> 
> We should be able to set up some larger determinism tests (building, say, all of Clang twice and checking the modules come out the same);

It's basically the same thing we do to test that the rest of the compiler is deterministic as part of a multi-stage bootstrap, extended to check the .pcm files. 

> I expect you could do something similar across Objective-C code. That should at least give us confidence that we've found the majority of the issues.

Just running this experiment with the Cocoa module (and everything it imports) should give us fairly decent coverage for Objective-C.

	- Doug


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20141024/6c2b4fc2/attachment.html>


More information about the cfe-commits mailing list