r228604 - Diagnose timeouts in the LockFileManager and delete the dead lock file

Thu Feb 26 15:38:29 PST 2015

On Thu, Feb 26, 2015 at 3:17 PM, Ben Langmuir <blangmuir at apple.com> wrote:

> On Feb 26, 2015, at 2:49 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>
> On Thu, Feb 26, 2015 at 8:30 AM, Ben Langmuir <blangmuir at apple.com> wrote:
>
>>
>> On Feb 25, 2015, at 6:20 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>>
>> On Tue, Feb 24, 2015 at 7:55 PM, Ben Langmuir <blangmuir at apple.com>
>> wrote:
>>
>>>
>>> On Feb 24, 2015, at 6:57 PM, Richard Smith <richard at metafoo.co.uk>
>>> wrote:
>>>
>>> On Mon, Feb 9, 2015 at 12:35 PM, Ben Langmuir <blangmuir at apple.com>
>>> wrote:
>>>
>>>> Author: benlangmuir
>>>> Date: Mon Feb  9 14:35:13 2015
>>>> New Revision: 228604
>>>>
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=228604&view=rev
>>>> Log:
>>>> Diagnose timeouts in the LockFileManager and delete the dead lock file
>>>>
>>>> If the lock file manager times out, we should give an error rather than
>>>> silently trying to load the existing module.  And delete the
>>>> (presumably) dead lock file, since it will otherwise prevent progress in
>>>> future invokations. This is unsound since we have no way to prove that
>>>> the lock file we are deleting is the same one we timed out on, but since
>>>> the lock is only to avoid excessive rebuilding anyway it should be okay.
>>>> Depends on llvm r228603.
>>>>
>>>
>>> This diagnostic fires a *lot* when, say, bootstrapping Clang with
>>> modules enabled at -j16. Is the timeout perhaps too short?
>>>
>>>
>>> I lowered the timeout to 1 minute in my llvm commit,
>>>
>>
>> Well, technically, it went from 8m 44s to 65.5s due to the exponential
>> backoff... :-/
>>
>>
>> Heh, good point.
>>
>>
>> so if it helps, feel free to bump it back up.  I’m not particularly
>>> concerned about the actual number now that we will actually honour the
>>> limit, but the old value of 5 minutes seemed really long to me.
>>>
>>
>> Really? Consider a Debug+Asserts clang, building a large module (such as
>> the Clang_AST module). Requiring that to succeed in 65.5 seconds doesn't
>> seem particularly reasonable (and, indeed, some such modules do not build
>> in that environment in 65.5s on my machine, at least when it's heavily
>> loaded).
>>
>>
>> That sucks.
>>
>>
>> But I'm not really sure that a timeout is the right approach here at all.
>> What we want to track is whether the creating process is still alive.
>> Perhaps we should have that process periodically bump the timestamp on the
>> lock file, and assume it's died if the timestamp is too far in the past?
>> This'd mean that in the (scary) case of sharing a module cache across a
>> build farm, you'd need your machines to not have too much clock skew, but
>> that's already necessary for build systems that use timestamps to determine
>> what to rebuild, so it doesn't seem like a ridiculous restriction.
>>
>> Thoughts?
>>
>>
>> I think what you’re proposing is a better way to track whether the
>> creator is alive than what we currently do (looking at the hostname+pid).
>> We could eliminate the clock-sync dependence if the waiting process set up
>> its own local timer and then reset it if it sees the timestamp of the lock
>> file has changed (ie. we don’t care what the value is, only whether it
>> changes).  But your suggestion is simpler and automatically handles a new
>> process waiting on a long-dead file in a good way.
>>
>> At some level I like that the timeout gives you feedback when a
>> compilation takes much longer than expected.  How about this: in addition
>> to tracking process liveness, we also use a timeout.  When the timeout is
>> reached, we emit a warning that the module build is taking a long time.  We
>> then forcibly delete the lock file and try again to acquire the lock
>> ourselves. This has the added advantage of combatting priority inversion.
>> If we reach the timeout again (maybe several times?), we emit an error and
>> fail.  What do you think?
>>
>
> For the most part, this sounds great to me. I'm not entirely convinced
> that it makes sense to remove the lock and retry if we can see that another
> process is periodically touching it.
>
>
> Okay, but right now any time we delete a lock file we may accidentally
> delete a valid lock because of races.  I haven’t found a good way to avoid
> the window between checking a lock file is invalid and deleting that file
> (which is done by name).
>

So long as we're conservatively correct in the case of a race, and the
probability is low enough, I don't think this is a huge problem (IIUC, we
should get a signature mismatch in any downstream modules if this goes
wrong, and end up triggering a bunch of additional rebuilds, but we should
converge to a stable set of module files in the end). The proposed
direction reduces the probability of a race (we only delete the file and
retry if the producing end appears to not be updating it), so it seems like
an improvement.

I think this is just part of the cost we pay by providing an implicit
module building system and a transparent upgrade path; we do also provide
the ability to explicitly build modules, which avoids all these issues.
(That said, we don't yet have a non-cc1 way of building a module
explicitly, and we really should add one.)

> I think issuing a remark after some timeout (1 minute, maybe?), and
> possibly failing after a longer timeout, may be reasonable.
>
>
> I was thinking a warning, but I guess -Werror makes that a bad idea.
> Failing instead of clearing the lock file is okay with me.
>
>
> We'd need a mechanism to turn off the second timeout (for cases where
> modules simply take a long time to build for whatever reason) and possibly
> to turn off the first timeout (to ensure we provide deterministic output).
>
>
> -Rmodule-timeout-soft
> -Wmodule-timeout (DefaultError)
>
> We could theoretically make a -W group that included both for convenience,
> although that might be weird with a remark...
>

We want to bail out of compilation when we hit the timeout, so a -W flag
doesn't seem like the right model to me (it would need to control more than
merely whether we emit the diagnostic). How about
-fimplicit-modules-build-timeout=<time in s>, following the usual
convention that 0 means "no limit"?

Another question: where do you think we should touch the file? In a
> consumer of top-level decls?
>

An ASTConsumer could be a reasonable place. That should also let us watch
for template instantiations performed at end of TU, which are sometimes a
sizeable portion of the parsing time.

> Modified:
>>>>     cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>>     cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>>>
>>>> Modified: cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td (original)
>>>> +++ cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td Mon Feb  9
>>>> 14:35:13 2015
>>>> @@ -83,6 +83,8 @@ def err_module_not_found : Error<"module
>>>>  def err_module_not_built : Error<"could not build module '%0'">,
>>>> DefaultFatal;
>>>>  def err_module_lock_failure : Error<
>>>>    "could not acquire lock file for module '%0'">, DefaultFatal;
>>>> +def err_module_lock_timeout : Error<
>>>> +  "timed out waiting to acquire lock file for module '%0'">,
>>>> DefaultFatal;
>>>>  def err_module_cycle : Error<"cyclic dependency in module '%0': %1">,
>>>>    DefaultFatal;
>>>>  def note_pragma_entered_here : Note<"#pragma entered here">;
>>>>
>>>> Modified: cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff
>>>>
>>>> ==============================================================================
>>>> --- cfe/trunk/lib/Frontend/CompilerInstance.cpp (original)
>>>> +++ cfe/trunk/lib/Frontend/CompilerInstance.cpp Mon Feb  9 14:35:13 2015
>>>> @@ -1022,9 +1022,19 @@ static bool compileAndLoadModule(Compile
>>>>      case llvm::LockFileManager::LFS_Shared:
>>>>        // Someone else is responsible for building the module. Wait for
>>>> them to
>>>>        // finish.
>>>> -      if (Locked.waitForUnlock() ==
>>>> llvm::LockFileManager::Res_OwnerDied)
>>>> +      switch (Locked.waitForUnlock()) {
>>>> +      case llvm::LockFileManager::Res_Success:
>>>> +        ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>>> +        break;
>>>> +      case llvm::LockFileManager::Res_OwnerDied:
>>>>          continue; // try again to get the lock.
>>>> -      ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>>> +      case llvm::LockFileManager::Res_Timeout:
>>>> +        Diags.Report(ModuleNameLoc, diag::err_module_lock_timeout)
>>>> +            << Module->Name;
>>>> +        // Clear the lock file so that future invokations can make
>>>> progress.
>>>> +        Locked.unsafeRemoveLockFile();
>>>> +        return false;
>>>> +      }
>>>>        break;
>>>>      }
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> cfe-commits mailing list
>>>> cfe-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150226/f1640f8e/attachment.html>