r228604 - Diagnose timeouts in the LockFileManager and delete the dead lock file

Thu Feb 26 15:57:00 PST 2015

> On Feb 26, 2015, at 3:38 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> 
> On Thu, Feb 26, 2015 at 3:17 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>> On Feb 26, 2015, at 2:49 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>> 
>> On Thu, Feb 26, 2015 at 8:30 AM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>> 
>>> On Feb 25, 2015, at 6:20 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>> 
>>> On Tue, Feb 24, 2015 at 7:55 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>> 
>>>> On Feb 24, 2015, at 6:57 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>>> 
>>>> On Mon, Feb 9, 2015 at 12:35 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>>> Author: benlangmuir
>>>> Date: Mon Feb  9 14:35:13 2015
>>>> New Revision: 228604
>>>> 
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=228604&view=rev <http://llvm.org/viewvc/llvm-project?rev=228604&view=rev>
>>>> Log:
>>>> Diagnose timeouts in the LockFileManager and delete the dead lock file
>>>> 
>>>> If the lock file manager times out, we should give an error rather than
>>>> silently trying to load the existing module.  And delete the
>>>> (presumably) dead lock file, since it will otherwise prevent progress in
>>>> future invokations. This is unsound since we have no way to prove that
>>>> the lock file we are deleting is the same one we timed out on, but since
>>>> the lock is only to avoid excessive rebuilding anyway it should be okay.
>>>> Depends on llvm r228603.
>>>> 
>>>> This diagnostic fires a *lot* when, say, bootstrapping Clang with modules enabled at -j16. Is the timeout perhaps too short?
>>> 
>>> I lowered the timeout to 1 minute in my llvm commit,
>>> 
>>> Well, technically, it went from 8m 44s to 65.5s due to the exponential backoff... :-/
>> 
>> Heh, good point.
>> 
>>> 
>>> so if it helps, feel free to bump it back up.  I’m not particularly concerned about the actual number now that we will actually honour the limit, but the old value of 5 minutes seemed really long to me.
>>> 
>>> Really? Consider a Debug+Asserts clang, building a large module (such as the Clang_AST module). Requiring that to succeed in 65.5 seconds doesn't seem particularly reasonable (and, indeed, some such modules do not build in that environment in 65.5s on my machine, at least when it's heavily loaded).
>> 
>> That sucks.
>> 
>>> 
>>> But I'm not really sure that a timeout is the right approach here at all. What we want to track is whether the creating process is still alive. Perhaps we should have that process periodically bump the timestamp on the lock file, and assume it's died if the timestamp is too far in the past? This'd mean that in the (scary) case of sharing a module cache across a build farm, you'd need your machines to not have too much clock skew, but that's already necessary for build systems that use timestamps to determine what to rebuild, so it doesn't seem like a ridiculous restriction.
>>> 
>>> Thoughts?
>> 
>> I think what you’re proposing is a better way to track whether the creator is alive than what we currently do (looking at the hostname+pid).  We could eliminate the clock-sync dependence if the waiting process set up its own local timer and then reset it if it sees the timestamp of the lock file has changed (ie. we don’t care what the value is, only whether it changes).  But your suggestion is simpler and automatically handles a new process waiting on a long-dead file in a good way.
>> 
>> At some level I like that the timeout gives you feedback when a compilation takes much longer than expected.  How about this: in addition to tracking process liveness, we also use a timeout.  When the timeout is reached, we emit a warning that the module build is taking a long time.  We then forcibly delete the lock file and try again to acquire the lock ourselves. This has the added advantage of combatting priority inversion.  If we reach the timeout again (maybe several times?), we emit an error and fail.  What do you think? 
>> 
>> For the most part, this sounds great to me. I'm not entirely convinced that it makes sense to remove the lock and retry if we can see that another process is periodically touching it.
> 
> Okay, but right now any time we delete a lock file we may accidentally delete a valid lock because of races.  I haven’t found a good way to avoid the window between checking a lock file is invalid and deleting that file (which is done by name).
> 
> So long as we're conservatively correct in the case of a race, and the probability is low enough, I don't think this is a huge problem (IIUC, we should get a signature mismatch in any downstream modules if this goes wrong, and end up triggering a bunch of additional rebuilds, but we should converge to a stable set of module files in the end). The proposed direction reduces the probability of a race (we only delete the file and retry if the producing end appears to not be updating it), so it seems like an improvement.

Right.

> 
> I think this is just part of the cost we pay by providing an implicit module building system and a transparent upgrade path; we do also provide the ability to explicitly build modules, which avoids all these issues. (That said, we don't yet have a non-cc1 way of building a module explicitly, and we really should add one.)
>> I think issuing a remark after some timeout (1 minute, maybe?), and possibly failing after a longer timeout, may be reasonable.
> 
> I was thinking a warning, but I guess -Werror makes that a bad idea.  Failing instead of clearing the lock file is okay with me.
> 
>> 
>> We'd need a mechanism to turn off the second timeout (for cases where modules simply take a long time to build for whatever reason) and possibly to turn off the first timeout (to ensure we provide deterministic output).
> 
> -Rmodule-timeout-soft
> -Wmodule-timeout (DefaultError)
> 
> We could theoretically make a -W group that included both for convenience, although that might be weird with a remark...
> 
> We want to bail out of compilation when we hit the timeout, so a -W flag doesn't seem like the right model to me (it would need to control more than merely whether we emit the diagnostic). How about -fimplicit-modules-build-timeout=<time in s>, following the usual convention that 0 means "no limit”?

Sounds good.  If we’re going to customize the timeout, it would be nice if the remark timeout is relative to the error timeout.  Maybe a 1:5 ratio, with a default of 1 or 2 minutes for the remark (so 5 or 10 minutes for the error).  I’m not sure how well that would come across in the -help text for this option though...

> 
> Another question: where do you think we should touch the file? In a consumer of top-level decls?
> 
> An ASTConsumer could be a reasonable place. That should also let us watch for template instantiations performed at end of TU, which are sometimes a sizeable portion of the parsing time.

Sounds good.

>  
>>>> Modified:
>>>>     cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>>     cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>>> 
>>>> Modified: cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff <http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff>
>>>> ==============================================================================
>>>> --- cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td (original)
>>>> +++ cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td Mon Feb  9 14:35:13 2015
>>>> @@ -83,6 +83,8 @@ def err_module_not_found : Error<"module
>>>>  def err_module_not_built : Error<"could not build module '%0'">, DefaultFatal;
>>>>  def err_module_lock_failure : Error<
>>>>    "could not acquire lock file for module '%0'">, DefaultFatal;
>>>> +def err_module_lock_timeout : Error<
>>>> +  "timed out waiting to acquire lock file for module '%0'">, DefaultFatal;
>>>>  def err_module_cycle : Error<"cyclic dependency in module '%0': %1">,
>>>>    DefaultFatal;
>>>>  def note_pragma_entered_here : Note<"#pragma entered here">;
>>>> 
>>>> Modified: cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff <http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/Frontend/CompilerInstance.cpp (original)
>>>> +++ cfe/trunk/lib/Frontend/CompilerInstance.cpp Mon Feb  9 14:35:13 2015
>>>> @@ -1022,9 +1022,19 @@ static bool compileAndLoadModule(Compile
>>>>      case llvm::LockFileManager::LFS_Shared:
>>>>        // Someone else is responsible for building the module. Wait for them to
>>>>        // finish.
>>>> -      if (Locked.waitForUnlock() == llvm::LockFileManager::Res_OwnerDied)
>>>> +      switch (Locked.waitForUnlock()) {
>>>> +      case llvm::LockFileManager::Res_Success:
>>>> +        ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>>> +        break;
>>>> +      case llvm::LockFileManager::Res_OwnerDied:
>>>>          continue; // try again to get the lock.
>>>> -      ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>>> +      case llvm::LockFileManager::Res_Timeout:
>>>> +        Diags.Report(ModuleNameLoc, diag::err_module_lock_timeout)
>>>> +            << Module->Name;
>>>> +        // Clear the lock file so that future invokations can make progress.
>>>> +        Locked.unsafeRemoveLockFile();
>>>> +        return false;
>>>> +      }
>>>>        break;
>>>>      }
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> cfe-commits mailing list
>>>> cfe-commits at cs.uiuc.edu <mailto:cfe-commits at cs.uiuc.edu>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits <http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150226/09b837ab/attachment.html>