[PATCH] D93393: [clangd] Ignore the static index refs from the dynamic index files.

Wed Dec 16 09:00:48 PST 2020

sammccall added a comment.

TL;DR: Cool! I think we should complicate the signature of `hasFile()` a bit to avoid calling SwapIndex::snapshot() for each file. There's also some quadratic behavior due to calling index ops from MergeIndex when children are MergeIndexes, but N is small and I think we can solve it later.

---

I like the simplicity/elegance of this solution, and the fact that it avoids the scaling pitfall of trying to return a list of all the indexed files.

I am concerned about performance though. At a high level, we try to make index operations fairly coarse/batched and only do a handful of them. This was originally to allow these operations to be RPCs, but of course this assumption gets baked in in other ways too. Introducing a fine-grained operation like this causes some of the assumptions to be violated.

Concretely, SwapIndex (and ProjectAwareIndex) take a lock to obtain the index snapshot to use, and so here we do that for each call to `hasFile()`. Most indexes are wrapped in SwapIndex for either lazy-loading or dynamic updating.

And we hammer it reasonably hard here, MergeIndex::refs() will call it in a loop for ~100 times, and because there are multiple MergeIndexes in the index stack, we end up with duplication: the first index never gets hasFile(), the second gets it 100 times, the third 200 times, the fourth 300 times. (Currently we have at least 4 indexes in the stack).

Taking 600 locks to call refs() seems... less than ideal. For now they'll probably be uncontended as we only call refs() for interactive actions (find-references and rename), but fundamentally this applies to all requests, not just refs, so I'm not sure this will be true in the future (a fast pseudoparser index will want to consult the previous index to resolve symbols).

This all seems related to the idea that `hasFile()` isn't implemented for RemoteIndex - it's a pragmatic shortcut, but in principle it'd be nice to be able to implement it, and there's some minimum per-RPC overhead.

---

So, can we do anything about all this?

Locking once for each ref: fundamentally we want to lock once to get the snapshot, then query the snapshot for each file. This changing hasFile to return an oracle that can be cheaply queried, e.g. a signature like `unique_function<bool(StringRef)> SymbolIndex::indexedFiles()`.

The quadratic nature of calling any index ops in MergedIndex: I guess probably not without flattening our binary-tree/linked-list of MergedIndex into an N-way merge. Back in the day, we basically had one static index which was huge, and one dynamic index which was tiny, and the no-copy aspect of MergedIndex was appealing (and it was easy to implement!). But we're probably not coming out ahead anymore vs simply committing to copy everything once. The N-way merge would have other advantages too: we could query large/slow indexes in parallel to improve latency.

Implementing a pseudo-batch form of hasFiles as an RPC for remote-index: we'd need to give up on getting answers for files one-by-one. That would be OK for an N-way merge (since we'd fetch all results before merging, we could build a list). But servers typically index *everything* under a certain directory, so fetching the list of directories and evaluating per-file queries client-side is a cheap and reasonable alternative (that's more accurate than `return false`!)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93393/new/

https://reviews.llvm.org/D93393