[PATCH] D55224: [clangd] Introduce loading of shards within auto-index

Tue Jan 8 09:32:33 PST 2019

ilya-biryukov added a comment.

Most comments are NITs, the major one that I'd suggest paying most attention to is about rewriting newer results with an older ones.

================
Comment at: clangd/index/Background.cpp:259
+      // if this thread sees the older version but finishes later. This should
+      // be rare in practice.
+      DigestIt.first->second = Hash;
----------------
kadircet wrote:
> ilya-biryukov wrote:
> > > "should be rare in practice"
> > And yet, can we avoid this altogether?
> > 
> > Also, I believe it won't be rare. When processing multiple different TUs, we can easily run into the same header multiple times, possibly with the different contents.
> > E.g. imagine the index for the whole of clang is being built and the user is editing `Sema.h` at the same time. We'll definitely be running into this case over and over again.
> Well I am open to ideas, but currently there is no way of knowing whether this is the newer version for the file. Because only information we have is the digest is different than what we currently have and this doesn't yield any chronological information.
Do we know which request is "newer" when scheduling it?
If so, we could keep the map of the **latest** hashes of the files we've seen (in addition to the `IndexedFileDigests`, which correspond to the digest for which we built the index IIUC).

This would give us a simple invariant of the final state we want to bring the index to: `IndexedFileDigests` should correspond to the latest hashes seen so far. If not, we have to rebuild the index for the corresponding files. That, in turn, gives us a way to resolve conflicts: we should never replace with symbols built for the latest version (hash) of the file we've seen.

Would that work?

================
Comment at: clangd/index/Background.cpp:338
     std::lock_guard<std::mutex> Lock(DigestsMu);
-    if (IndexedFileDigests.lookup(AbsolutePath) == Hash) {
-      vlog("No need to index {0}, already up to date", AbsolutePath);
----------------
kadircet wrote:
> ilya-biryukov wrote:
> > This looks like an important optimization. Did we move it to some other place? Why should we remove it?
> We kind of moved it into loadShards logic by not running indexing for same file multiple times. I needed to delete this check since we might wanna run indexing on a TU, even if it is up-to-date, to get new symbols coming from one of it's includes.
Thanks for the explanation, removing the optimization makes sense, we should check for dependency hashes in addition to the original file now.

================
Comment at: clangd/index/Background.cpp:481
+          if (!Buf)
+            continue;
+          // If digests match then dependency doesn't need re-indexing.
----------------
kadircet wrote:
> ilya-biryukov wrote:
> > maybe log the error? would be nice to somehow recover too, but not sure what we can do here.
> Well, if this happens we don't have access to file's contents for some reason. I don't think we can do anything. We might actually want to skip loading these files into index, since they are most likely gone and symbols coming from them won't be accessible. WDYT?
Logging the error is enough, thanks!

I think the ideal recovery is tracking whenever the files we were missing were added to the filesystem and rebuilding the index when that happens.
That probably requires more work than we'd like to put into it, though, so I don't think investing in this now makes any sense.

================
Comment at: clangd/index/Background.cpp:261
+      auto Hash = I.second.Digest;
+      std::lock_guard<std::mutex> Lock(DigestsMu);
+      // Skip if file is already up to date.
----------------
This means lock/unlock on every iteration of the loop.
Could we move collecting the files we need to process into a separate loop that would only require a single lock over the whole loop?

================
Comment at: clangd/index/Background.cpp:414
+                           BackgroundIndexStorage *IndexStorage,
+                           llvm::StringSet<> &VisitedShards) {
+  // Adds the given shard's contents into BackgroundIndex.
----------------
NIT: s/VisitedShards/LoadedShards?
To better align with the name of the local var used in `loadShards()`

================
Comment at: clangd/index/Background.cpp:416
+  // Adds the given shard's contents into BackgroundIndex.
+  // Consumes Symbols and Refs in Shard.
+  auto AddShardToIndex = [this](llvm::StringRef AbsolutePath,
----------------
NIT: Instead of writing this comment, we could accept `IndexFileIn &&` as a parameter, which clearly states we're consuming it (this would require handling the `nullptr` case in the call sites, but that should be ok).

================
Comment at: clangd/index/Background.cpp:425
+    if (Shard->Symbols)
+      SS = llvm::make_unique<SymbolSlab>(std::move(*Shard->Symbols));
+    if (Shard->Refs)
----------------
NIT: maybe init at the declaration site? i.e. `unique_ptr<Slab> SS = Shared->Symbols ? llvm::make_unique<>() : nullptr;`
Same for `RS`.

================
Comment at: clangd/index/Background.cpp:434
+      // if this thread sees the older version but finishes later. This
+      // should be rare in practice.
+      IndexedFileDigests[AbsolutePath] = IGN.Digest;
----------------
Would it make sense to collect all symbols into a local variable and only update under a lock once the whole loading process is finished?
I guess that means ~the same memory requirements, but much less locks and unlocks.

================
Comment at: clangd/index/Background.cpp:468
+    if (!Shard || !Shard->Sources) {
+      vlog("Failed to load shard: {0}", CurDependencyPath);
+      continue;
----------------
Should we mark the file as requiring re-indexing at this point?
Not sure how that might happen in practice, but still...

================
Comment at: clangd/index/Background.cpp:473
+    for (const auto &I : *Shard->Sources) {
+      if (auto U = URI::parse(I.getKey())) {
+        if (auto AbsolutePath = URI::resolve(*U, CurDependencyPath)) {
----------------
NIT: use early exits to reduce nesting. See [[ https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code | LLVM Style Guide ]].

```
auto U = URI::parse(...);
if (!U) 
  continue;
auto AbsolutePath = URI::resolve(...);
if (!AbsolutePath)
  continue;
...
```

================
Comment at: clangd/index/Background.cpp:475
+        if (auto AbsolutePath = URI::resolve(*U, CurDependencyPath)) {
+          if (*AbsolutePath != CurDependencyPath) {
+            if (InQueue.try_emplace(*AbsolutePath).second)
----------------
Why do we need an extra check for self-edges here? Shouldn't `InQueue.try_emplace` handle this too?

================
Comment at: clangd/index/Background.cpp:503
+// Goes over each changed file and loads them from index, in the case of a
+// missing or out-dated shard enqueues an indexing operation.
+llvm::Error BackgroundIndex::loadShards(
----------------
NIT: s/enqueues an indexing operation/add to a list of files requiring reindexing
(or any other wording with the same meaning, the important part is that it does not enqueue any indexing operations anymore)

================
Comment at: clangd/index/Background.cpp:510
+  // re-index same dependencies more than once. Keys are AbsolutePaths.
+  llvm::StringSet<> BeingReindexed;
+  // Keeps track of the loaded shards to make sure we don't perform redundant
----------------
Now that we don't actually schedule the indexing, maybe rename this to something like `FilesToIndex`? The current name suggests there's some processing on the files going somewhere in the background, which is not the case.

================
Comment at: clangd/index/Background.cpp:525
+      // out a minimal set of TUs that will cover all the stale dependencies.
+      if (Dependency.NeedsReIndexing &&
+          !BeingReindexed.count(Dependency.Path)) {
----------------
NIT: use early exits to reduce nesting.

```
if (!Dependency.NeedsReIndexing || BeingReindexed.count(...))
  continue;
...
```

================
Comment at: clangd/index/Background.cpp:527
+          !BeingReindexed.count(Dependency.Path)) {
+        vlog("Enqueueing {0} for {1}", Cmd->Filename, Dependency.Path);
+        NeedsReIndexing.push_back({std::move(*Cmd), IndexStorage});
----------------
Maybe be a bit more verbose in the message to clearly state which file is a dependency? Something like `Enqueuing TU {0} because its dependency {1} needs re-indexing`.

================
Comment at: clangd/index/Background.h:110
+    std::string Path;
+    bool NeedsReIndexing;
+  };
----------------
NIT: Provide the default value to avoid UB on uninitialized values.

================
Comment at: clangd/index/Background.h:112
+  };
+  // Loads the shards for all the sources that was touched by this Cmd when the
+  // index was build. Returns the list of sources and whether they need to be
----------------
NIT: `all the sources that was touched by this Cmd` is a bit confusing (e.g. what does "touched by compile command" mean?) Maybe rephrase to something like `Loads the shards for a single TU and all of its dependencies`

================
Comment at: clangd/index/Background.h:121
+      std::vector<std::pair<tooling::CompileCommand, BackgroundIndexStorage *>>
+          &NeedsReIndexing);
+  void enqueue(tooling::CompileCommand Cmd, BackgroundIndexStorage *Storage);
----------------
Maybe return `Expected<vector<..>>` instead of using an out parameter?

Repository:
  rCTE Clang Tools Extra

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55224/new/

https://reviews.llvm.org/D55224