[PATCH] D109506: [RFC] Print current request context along with the stack trance in clangd

Fri Sep 10 13:12:32 PDT 2021

sammccall added a comment.

Well, thanks for such a well-thought-out patch :-) I have to admit my first reaction was terror, but this seems solid to me.

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:1381
+  auto &OS = llvm::errs();
+  OS << "Signalled during AST action:\n";
+  auto &Command = FileInputs.CompileCommand;
----------------
0x1eaf wrote:
> sammccall wrote:
> > the name of the action is available as CurrentRequest->Action, which I think is safe to either pass into this function or read directly here.
> I've noticed the the action name is always hardcoded as “Build AST ({version})” and just logged the version directly below, but I guess it's not forward compatible, so I'm going to log both…
> the action name is always hardcoded as “Build AST ({version})”

This shouldn't *always* be the case, there should be entries like "Hover" etc too, corresponding to `WorkScheduler->runWithAST("Hover", ...)` in ClangdServer.cpp.

However the "build AST" actions are the ones that run the clang parser, so they're the slowest and crashiest for sure!

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:1392
+  OS << "Contents:\n";
+  OS << FileInputs.Contents << "\n";
+}
----------------
0x1eaf wrote:
> sammccall wrote:
> > sammccall wrote:
> > > hmm, this is potentially really useful (could gather full reproducers) but also is going to dominate the output and make it annoying to see the stack traces/other details. I'm worried this may make this feature net-negative for a majority of users, who won't find the right parts of the output to report...
> > > 
> > > How critical do you think this is for debugging crashes in practice? (I suspect fairly important!)
> > > Since we're taking liberties with async-signal-safety anyway, it might be possible to get this sent to a tempfile, which might be more useful. In any case, we might want full file dumps to be off by default outside of environments where anyone is likely to actually try to get the reproducers.
> > > 
> > > I'd probably suggest leaving this out of the initial patch entirely so we can focus on getting the rest of the mechanism right.
> > another thought along the lines of reproducers (but nothing for this patch)...
> > 
> > If we're working on a file and have a preamble, then the headers that make up the preamble are relevant. In practice it's hard to reproduce bug reports we get because we need all their headers. If we're using a preamble then just dumping all the filenames it includes would make it possible for some tool to take the crash report and zip up a reproducer.
> > 
> > (This probably needs the ability to run all of the handlers for the current thread, not just the top of the stack, since we acquire the relevant preamble later)
> > How critical do you think this is for debugging crashes in practice?
> 
> For us it would likely be of utmost importance as our VSCode extension routes requests to [Glean](https://glean.software) until local clangd finishes the compilation, so clangd ends up mostly being used for modified files, and the crashes might be unreproducible on the original source…
> 
> > I'd probably suggest leaving this out of the initial patch entirely so we can focus on getting the rest of the mechanism right.
> 
> Alright, will remove the file contents with the next update.
> 
> > In any case, we might want full file dumps to be off by default outside of environments where anyone is likely to actually try to get the reproducers.
> 
> What would be the best way to re-add it in a subsequent revision: make it configurable via a command line option, or via an environment variable? And if it's just an env var, maybe we could include it conditionally in this revision?
> 
> > If we're using a preamble then just dumping all the filenames it includes would make it possible for some tool to take the crash report and zip up a reproducer.
> 
> We had discussions within the team about potentially implementing a similar tool, but the rough idea was to rely on VCS to collect the changes as the delta would normally be smaller than bundling all the headers that the TU includes. But if the VCS-agnostic approach would be simpler and more portable — I'm all for it :)
> What would be the best way to re-add it in a subsequent revision: make it configurable via a command line option, or via an environment variable? And if it's just an env var, maybe we could include it conditionally in this revision?

Yeah an env var seems tempting to me. I can't put my finger on *why* it feels like an appropriate way to configure this vs a flag, but at least not having to plumb it around is nice.
I'd be happy with `if (getenv("CLANGD_CRASH_DUMP_SOURCE"))` or something like that in this patch.

> the rough idea was to rely on VCS to collect the changes

Right, this is probably why we don't have such a thing yet - at work we can just dump the VCS info and be done. Being able to get self-contained bug reports from more diverse environments would be nice to have, but hasn't been urgent.

================
Comment at: clang-tools-extra/clangd/support/ThreadSignalHandler.cpp:15
+
+static llvm::sys::ThreadLocal<ThreadSignalHandlerCallback> CurrentCallback;
+
----------------
0x1eaf wrote:
> sammccall wrote:
> > sammccall wrote:
> > > sammccall wrote:
> > > > we assume `thread_local` in clangd (see support/Context.ccp)
> > > > 
> > > > It's possible that llvm::sys::ThreadLocal happens to play nicer with signal handlers in practice (none of these are completely safe!), but unless we've seen some concrete evidence I'd rather just have `static thread_local ThreadSignalHandlerCallback*` here
> > > we can quite easily support a stack of handlers, such that we dump *all* living handlers on the crashing thread.
> > > 
> > > just give each ThreadSignalHandler a `ThreadSignalHandler* Next = nullptr` member, and both the constructor and destructor `swap(Next, CurrentCallback)`. Then CurrentCallback points to the head of a linked list of handlers.
> > > 
> > > Not sure if you want to have that in this patch, but it's pretty simple so I wouldn't object.
> > Sorry, just realized you already suggested the stacking in the patch description!
> Having `Next` pointer chains makes me worried about out of order destruction, but if we are removing the move constructor and making the crash handler object scope-bound always, that could actually work. But if we want to keep the possibility of storing the handler in an object to preserve it across method calls, I guess we could just make this a doubly-linked list and cut out the link gracefully.
I think simplicity is probably more important than being totally flexible here, so I'd just forbid moving and out-of-order destruction.
If we end up wanting to move these crash handlers across scopes I'd suggest we just wrap them in `unique_ptr` in those cases. (And make it the creator's job to ensure their lifetimes nest properly).

================
Comment at: clang-tools-extra/clangd/support/ThreadSignalHandler.cpp:22
+      ThreadSignalHandlerCallback *Callback = CurrentCallback.get();
+      // TODO: ignore non thread local signals like SIGTERM
+      if (Callback) {
----------------
0x1eaf wrote:
> sammccall wrote:
> > I think we should do this already, probably just starting with an allowlist of `SIGBUS, SIGFPE, SIGILL, SIGSEGV` (these are signals that are delivered to the thread that triggered them per the linux docs).
> > 
> > I'm also wondering how portable this assumption/mechanism is, e.g. whether we should enable it for certain platforms only. I guess it's probably true for unixy things, and apparently mostly true for windows too[1]? I'm a little concerned *something* may go wrong on windows, though.
> > 
> > [1] https://stackoverflow.com/questions/15879342/which-thread-are-signal-handlers-e-g-signalsigint-crtl-c-called-on
> I've just looked into adding filtering by signal number and there doesn't seem to be a way to do this portably in LLVM. But it looks like it might be safe enough to skip the filtering at all:
> * From a cursory look at Windows `Signals.inc` it looks like the “signal” handling is implemented via [SetUnhandledExceptionFilter()](https://docs.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-setunhandledexceptionfilter) which is thread-directed, and the `SetConsoleCtrlHandler()` callback [does not call](https://reviews.llvm.org/source/llvm-github/browse/main/llvm/lib/Support/Windows/Signals.inc$843-846) the registered handlers.
> * And on Unix platforms the `AddSignalHandler()` callbacks are called [only](https://reviews.llvm.org/source/llvm-github/browse/main/llvm/include/llvm/Support/Signals.h$62-64) for [KillSigs](https://reviews.llvm.org/source/llvm-github/browse/main/llvm/lib/Support/Unix/Signals.inc$217) out of which only `SIGQUIT` seem to be process-directed and would be delivered to [any thread](https://man7.org/linux/man-pages/man7/signal.7.html) that is not blocking it, but if the thread gets delivered to has a `ThreadCrashReporter` installed during the interrupt — it seems to make sense to run the handler and let it print the thread-local context information (at least there seems to be no harm in doing so).
> 
> What do you think?
This sounds fine to me.

In a perfect world I'd rather SIGQUIT didn't dump the state of an essentially-random thread (we always have lots). But SIGQUIT isn't the common case, it's confusing but not otherwise harmful, and it seems like it'd be hard to fix.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109506/new/

https://reviews.llvm.org/D109506