[PATCH] D93873: [clangd] Cache preambles of closed files

Mon Jan 11 08:02:09 PST 2021

sammccall added a reviewer: kadircet.
sammccall added a comment.

Thanks! This is a really good idea, but not without its risks :-) Sorry for not getting to this for a while!

My only *really* high-level design question is: I wonder what the tradeoffs are in keeping just the preamble vs the whole ASTWorker. I'd *expect* this approach gets almost all the benefit without complicating the basic assumptions around ASTWorker, so it's probably the right way to go. Haven't thought about it much though...

---

I've got a bit of cognitive load because there's a few preamble-related ideas we've had, and I'm trying to work out how they interact. Let's start by writing them down...

1. a *persistent* cache so closing+reopening clangd loses less state. (This is complicated because only the PCH is easily serializable, the rest of the PreambleData struct isn't)
2. building caches of preambles while background-indexing (this would be great for modules but is probably way too big for whole preambles)
3. reusing the "wrong" preamble initially when you open a new file, to give some basic functionality (using existing preamble patching logic, just in a more aggressive scenario)
4. having the disk-based storage unlink the file preemptively, to eliminate any chance of leaking the *.pch

1&2 are both big projects, so I'm not really worried about the overlap or at least not going to try to predict how it'd be resolved.

3 is actually made *easier* by this patch I think: having a central repository of preambles makes it more natural to grab some "other" preamble.

4 I think doesn't really interact at all: the PreambleData object would own a file descriptor instead of a file, but it's all much the same.

So I think in summary there's not really any conflict to resolve with these ideas. cc @kadircet though who's done more thinking about #1 and #3 than I have.

---

I think we need to be fairly careful about policies around this cache. Preambles are large (we frequently see several hundred MB). Some workflows involve opening many files at a time. Some workflows end up running multiple copies of clangd on the same files. Some configurations keep them in memory rather than on disk. So a too-large cache could waste quite a lot of resources.

So, some pointy questions:

- landing so close to the 12 release, should we conservatively default this to 0, and require opt-in?
- is MB or #preambles a better limit to the cache size?
- should we take size into account when deciding what to evict? (my guess is no, cost scales with size, and value scales with size * probability of reuse, so we should purely optimize for probability of reuse)
- can we do better than LRU? The cache is accessed so infrequently and misses are so horrendously expensive that we could certainly affort to e.g. track usage history of every file ever opened, if it would help performance and not add too much complexity.
- not a question, but I can say for sure that 1000 with no size limit isn't a safe default for disk :-(

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:193
+  /// Get the preamble associated with a \p Key, removing
+  /// it from the cache
+  std::shared_ptr<const PreambleData> take(llvm::StringRef Key) {
----------------
we might instead consider keeping "active" preambles in the cache, and simply considering their cost to be 0/ineligible for eviction if the shared_ptr::usage_count > 1. 
This allows this cache to be a "registry" so we can try using a preamble from a different TU as mentioned above.

(but this could also be done later, or could be done with a separate table of weak_ptrs. No change needed for this patch, just thinking out loud)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93873/new/

https://reviews.llvm.org/D93873