r228604 - Diagnose timeouts in the LockFileManager and delete the dead lock file

Ben Langmuir blangmuir at apple.com
Thu Feb 26 15:17:00 PST 2015


> On Feb 26, 2015, at 2:49 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> 
> On Thu, Feb 26, 2015 at 8:30 AM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
> 
>> On Feb 25, 2015, at 6:20 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>> 
>> On Tue, Feb 24, 2015 at 7:55 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>> 
>>> On Feb 24, 2015, at 6:57 PM, Richard Smith <richard at metafoo.co.uk <mailto:richard at metafoo.co.uk>> wrote:
>>> 
>>> On Mon, Feb 9, 2015 at 12:35 PM, Ben Langmuir <blangmuir at apple.com <mailto:blangmuir at apple.com>> wrote:
>>> Author: benlangmuir
>>> Date: Mon Feb  9 14:35:13 2015
>>> New Revision: 228604
>>> 
>>> URL: http://llvm.org/viewvc/llvm-project?rev=228604&view=rev <http://llvm.org/viewvc/llvm-project?rev=228604&view=rev>
>>> Log:
>>> Diagnose timeouts in the LockFileManager and delete the dead lock file
>>> 
>>> If the lock file manager times out, we should give an error rather than
>>> silently trying to load the existing module.  And delete the
>>> (presumably) dead lock file, since it will otherwise prevent progress in
>>> future invokations. This is unsound since we have no way to prove that
>>> the lock file we are deleting is the same one we timed out on, but since
>>> the lock is only to avoid excessive rebuilding anyway it should be okay.
>>> Depends on llvm r228603.
>>> 
>>> This diagnostic fires a *lot* when, say, bootstrapping Clang with modules enabled at -j16. Is the timeout perhaps too short?
>> 
>> I lowered the timeout to 1 minute in my llvm commit,
>> 
>> Well, technically, it went from 8m 44s to 65.5s due to the exponential backoff... :-/
> 
> Heh, good point.
> 
>> 
>> so if it helps, feel free to bump it back up.  I’m not particularly concerned about the actual number now that we will actually honour the limit, but the old value of 5 minutes seemed really long to me.
>> 
>> Really? Consider a Debug+Asserts clang, building a large module (such as the Clang_AST module). Requiring that to succeed in 65.5 seconds doesn't seem particularly reasonable (and, indeed, some such modules do not build in that environment in 65.5s on my machine, at least when it's heavily loaded).
> 
> That sucks.
> 
>> 
>> But I'm not really sure that a timeout is the right approach here at all. What we want to track is whether the creating process is still alive. Perhaps we should have that process periodically bump the timestamp on the lock file, and assume it's died if the timestamp is too far in the past? This'd mean that in the (scary) case of sharing a module cache across a build farm, you'd need your machines to not have too much clock skew, but that's already necessary for build systems that use timestamps to determine what to rebuild, so it doesn't seem like a ridiculous restriction.
>> 
>> Thoughts?
> 
> I think what you’re proposing is a better way to track whether the creator is alive than what we currently do (looking at the hostname+pid).  We could eliminate the clock-sync dependence if the waiting process set up its own local timer and then reset it if it sees the timestamp of the lock file has changed (ie. we don’t care what the value is, only whether it changes).  But your suggestion is simpler and automatically handles a new process waiting on a long-dead file in a good way.
> 
> At some level I like that the timeout gives you feedback when a compilation takes much longer than expected.  How about this: in addition to tracking process liveness, we also use a timeout.  When the timeout is reached, we emit a warning that the module build is taking a long time.  We then forcibly delete the lock file and try again to acquire the lock ourselves. This has the added advantage of combatting priority inversion.  If we reach the timeout again (maybe several times?), we emit an error and fail.  What do you think? 
> 
> For the most part, this sounds great to me. I'm not entirely convinced that it makes sense to remove the lock and retry if we can see that another process is periodically touching it.

Okay, but right now any time we delete a lock file we may accidentally delete a valid lock because of races.  I haven’t found a good way to avoid the window between checking a lock file is invalid and deleting that file (which is done by name).

> I think issuing a remark after some timeout (1 minute, maybe?), and possibly failing after a longer timeout, may be reasonable.

I was thinking a warning, but I guess -Werror makes that a bad idea.  Failing instead of clearing the lock file is okay with me.

> 
> We'd need a mechanism to turn off the second timeout (for cases where modules simply take a long time to build for whatever reason) and possibly to turn off the first timeout (to ensure we provide deterministic output).

-Rmodule-timeout-soft
-Wmodule-timeout (DefaultError)

We could theoretically make a -W group that included both for convenience, although that might be weird with a remark...

Another question: where do you think we should touch the file? In a consumer of top-level decls?

> 
>>> Modified:
>>>     cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>>     cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>> 
>>> Modified: cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td
>>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff <http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td?rev=228604&r1=228603&r2=228604&view=diff>
>>> ==============================================================================
>>> --- cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td (original)
>>> +++ cfe/trunk/include/clang/Basic/DiagnosticCommonKinds.td Mon Feb  9 14:35:13 2015
>>> @@ -83,6 +83,8 @@ def err_module_not_found : Error<"module
>>>  def err_module_not_built : Error<"could not build module '%0'">, DefaultFatal;
>>>  def err_module_lock_failure : Error<
>>>    "could not acquire lock file for module '%0'">, DefaultFatal;
>>> +def err_module_lock_timeout : Error<
>>> +  "timed out waiting to acquire lock file for module '%0'">, DefaultFatal;
>>>  def err_module_cycle : Error<"cyclic dependency in module '%0': %1">,
>>>    DefaultFatal;
>>>  def note_pragma_entered_here : Note<"#pragma entered here">;
>>> 
>>> Modified: cfe/trunk/lib/Frontend/CompilerInstance.cpp
>>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff <http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Frontend/CompilerInstance.cpp?rev=228604&r1=228603&r2=228604&view=diff>
>>> ==============================================================================
>>> --- cfe/trunk/lib/Frontend/CompilerInstance.cpp (original)
>>> +++ cfe/trunk/lib/Frontend/CompilerInstance.cpp Mon Feb  9 14:35:13 2015
>>> @@ -1022,9 +1022,19 @@ static bool compileAndLoadModule(Compile
>>>      case llvm::LockFileManager::LFS_Shared:
>>>        // Someone else is responsible for building the module. Wait for them to
>>>        // finish.
>>> -      if (Locked.waitForUnlock() == llvm::LockFileManager::Res_OwnerDied)
>>> +      switch (Locked.waitForUnlock()) {
>>> +      case llvm::LockFileManager::Res_Success:
>>> +        ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>> +        break;
>>> +      case llvm::LockFileManager::Res_OwnerDied:
>>>          continue; // try again to get the lock.
>>> -      ModuleLoadCapabilities |= ASTReader::ARR_OutOfDate;
>>> +      case llvm::LockFileManager::Res_Timeout:
>>> +        Diags.Report(ModuleNameLoc, diag::err_module_lock_timeout)
>>> +            << Module->Name;
>>> +        // Clear the lock file so that future invokations can make progress.
>>> +        Locked.unsafeRemoveLockFile();
>>> +        return false;
>>> +      }
>>>        break;
>>>      }
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> cfe-commits mailing list
>>> cfe-commits at cs.uiuc.edu <mailto:cfe-commits at cs.uiuc.edu>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits <http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150226/2349a65d/attachment.html>


More information about the cfe-commits mailing list