[PATCH] D55224: [clangd] Introduce loading of shards within auto-index

Ilya Biryukov via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Dec 17 06:26:45 PST 2018


ilya-biryukov added a comment.

Probably the most important suggestion from my side is trying to clearly separate the `enqueue` calls (which actually schedule rebuilds of TUs) using clang and the `loadShards` function.
The index loading should be fast, so I assume we won't win much latency by scheduling the indexing actions in the middle of the shard-loading.
OTOH, separating those two concerns should make the code way more readable (I think, at least).

The concrete proposal is to make `loadShards` actually **only** the shards and return a list of TUs that still need to be rebuilt. As the second step, we can `enqueue` the rebuild of those TUs.

Another important thing that's worth doing is documenting the correctness trade-offs we have in the current model, i.e. when following the include edges of a file (say, `foo.h`) that had its digest changed, we can potentially:

1. load stale symbols for some other file (say, `bar.h`) previously included by it,
2. never update those symbols to fresh ones if the include edge was dropped, i.e. `foo.h` does not include `bar.h` anymore.

The described situation can lead to stale symbols for `bar.h` being kept indefinitely in some cases, but has the advantage of reparsing less TUs when rebuilding the index. Overall it's definitely fine to have this trade-off for the time-being, but would be nice if we could document it.

The rest is mostly NITs and code style comments.

And thanks for the change, it's an impressive piece of work!



================
Comment at: clangd/SourceCode.h:96
+
+// Handle the symbolic link path case where the current working directory
+// (getCurrentWorkingDirectory) is a symlink./ We always want to the real
----------------
We might want to have a guidance on when this function should be used, similar to `getRealPath`. Also mention why it's different from the above.

My understanding is that it's only used for canonicalizing the file path for storing them in the index and nowhere else.
Should we discourage usages outside the index? Maybe we can give a more obscure name for this purpose, e.g. `canonicalizeForIndex` or similar?



================
Comment at: clangd/SourceCode.h:107
+// "/tmp/build/foo.h"
+std::string makeCanonicalPath(llvm::StringRef AbsolutePath,
+                              const SourceManager &SM);
----------------
NIT: maybe name it `canonicalizePath`? A bit shorter. But up to you


================
Comment at: clangd/SourceCode.h:107
+// "/tmp/build/foo.h"
+std::string makeCanonicalPath(llvm::StringRef AbsolutePath,
+                              const SourceManager &SM);
----------------
ilya-biryukov wrote:
> NIT: maybe name it `canonicalizePath`? A bit shorter. But up to you
Maybe put this function right after `getRealPath`? They both "canonicalize" paths, so it makes sense for them to live together.


================
Comment at: clangd/index/Background.cpp:90
+
+llvm::SmallString<128> getAbsolutePath(const tooling::CompileCommand &Cmd) {
+  llvm::SmallString<128> AbsolutePath;
----------------
Why not `vfs->makeAbsolute` instead of this function? Maybe worth a comment?


================
Comment at: clangd/index/Background.cpp:189
+      [this, Storage](tooling::CompileCommand Cmd) {
+        Cmd.CommandLine.push_back("-resource-dir=" + ResourceDir);
+        const std::string FileName = Cmd.Filename;
----------------
Ah, it feels we should move this into a helper function and reuse everywhere.
It's a small detail, but it's so easy to forget about it...

Just complaining, no need to do anything about it...


================
Comment at: clangd/index/Background.cpp:259
+      // if this thread sees the older version but finishes later. This should
+      // be rare in practice.
+      DigestIt.first->second = Hash;
----------------
> "should be rare in practice"
And yet, can we avoid this altogether?

Also, I believe it won't be rare. When processing multiple different TUs, we can easily run into the same header multiple times, possibly with the different contents.
E.g. imagine the index for the whole of clang is being built and the user is editing `Sema.h` at the same time. We'll definitely be running into this case over and over again.


================
Comment at: clangd/index/Background.cpp:309
     if (std::error_code EC =
             SM.getFileManager().getVirtualFileSystem()->makeAbsolute(AbsPath)) {
       elog("Warning: could not make absolute file: {0}", EC.message());
----------------
Does it make sense for it to be part of `makeCanonicalPath`?
Both call sites call `vfs->getAbsolutePath` before calling the function?


================
Comment at: clangd/index/Background.cpp:338
     std::lock_guard<std::mutex> Lock(DigestsMu);
-    if (IndexedFileDigests.lookup(AbsolutePath) == Hash) {
-      vlog("No need to index {0}, already up to date", AbsolutePath);
----------------
This looks like an important optimization. Did we move it to some other place? Why should we remove it?


================
Comment at: clangd/index/Background.cpp:365
   StringMap<FileDigest> FilesToUpdate;
+  // Main file always needs to be updated even if it has no symbols.
+  FilesToUpdate[AbsolutePath] = Hash;
----------------
The comment does not mention **why** should we update it. Is it because of the dependencies?


================
Comment at: clangd/index/Background.cpp:386
   Action->EndSourceFile();
+  if (!Index.Symbols)
+    return createStringError(inconvertibleErrorCode(), "Uncompilable errors");
----------------
Was this error possible before this change too?


================
Comment at: clangd/index/Background.cpp:389
 
-  log("Indexed {0} ({1} symbols, {2} refs, {3} files)",
-      Inputs.CompileCommand.Filename, Index.Symbols->size(),
-      Index.Refs->numRefs(), Index.Sources->size());
+  vlog("Indexed {0} ({1} symbols, {2} refs, {3} files)",
+       Inputs.CompileCommand.Filename, Index.Symbols->size(),
----------------
This log is not getting called more often after this change, right?
Could you send the `log -> vlog` replacement as a separate patch to keep this change more focused? 


================
Comment at: clangd/index/Background.cpp:412
+  // Consumes Symbols and Refs in Shard.
+  auto LoadShard = [this](llvm::StringRef AbsolutePath, IndexFileIn *Shard,
+                          const IncludeGraphNode &IGN) {
----------------
NIT: maybe rename to `AddShardToIndex` or `AddShard`? `LoadShard` is too similar to `IndexStorage::loadShard`, which confused me a little when going through the code for the first time.


================
Comment at: clangd/index/Background.cpp:481
+          if (!Buf)
+            continue;
+          // If digests match then dependency doesn't need re-indexing.
----------------
maybe log the error? would be nice to somehow recover too, but not sure what we can do here.


================
Comment at: clangd/index/Background.cpp:500
+  // Keeps track of the loaded shards to make sure we don't perform redundant
+  // disk io. Keys are AbsolutePaths.
+  llvm::StringSet<> LoadedShards;
----------------
NIT: `s/io/IO`
NIT: `s/AbsolutePaths/absolute paths` (otherwise looks like we're referring to a local variable or a field with this name)


================
Comment at: clangd/index/Background.cpp:502
+  llvm::StringSet<> LoadedShards;
+  for (const auto &File : ChangedFiles) {
+    ProjectInfo PI;
----------------
We used to shuffle the `ChangedFiles` before processing them. I believe the reason for doing that are still relevant. Should we restore this? (Or are we doing it somewhere else now and I missed it?)


================
Comment at: clangd/index/Background.cpp:508
+    BackgroundIndexStorage *IndexStorage = IndexStorageFactory(PI.SourceRoot);
+    auto Dependencies = loadShard(*Cmd, IndexStorage, LoadedShards);
+    for (const auto &Dependency : Dependencies) {
----------------
Maybe return a `vector<string>` with dependencies that need reindexing?  We ignore the dependencies that don't need to be reindexed anyway.


================
Comment at: clangd/index/Background.cpp:511
+      // FIXME: Currently, we simply schedule indexing on a TU whenever any of
+      // its dependencies needs re-indexing. We might do it more smartly by
+      // figuring out a minimal set of TUs that will cover all the stale
----------------
NIT: `s/more smartly/smarter`


================
Comment at: clangd/index/Background.cpp:514
+      // dependencies.
+      if (Dependency.second && !BeingReindexed.count(Dependency.first)) {
+        vlog("Enqueueing {0} for {1}", Cmd->Filename, Dependency.first);
----------------
What is `Dependency.second` and `Dependency.first`? Maybe use a named struct instead of a pair here or deconstruct into named variables?


================
Comment at: clangd/index/Background.cpp:519
+        // try to re-index those.
+        std::for_each(Dependencies.begin(), Dependencies.end(),
+                      [&BeingReindexed](decltype(Dependency) Dependency) {
----------------
Maybe use range-based-for instead? Looks simpler:
```
for (auto Dep : Dependencies)
  BeingReindexed.insert(Dep.first);
```


================
Comment at: clangd/index/Background.h:108
   BackgroundIndexStorage::Factory IndexStorageFactory;
+  // Loads the shards for all the sources and headers of the Cmd. Returns the
+  // list of dependencies for this Cmd and whether they need to be re-indexed.
----------------
Could you elaborate what are the "sources and headers of the Cmd"? The compile command is typically created for a **single** source file, so this comment looks a bit confusing.


================
Comment at: clangd/index/Background.h:111
+  std::vector<
+      std::pair</*Path to dependency*/ std::string, /*NeedsReIndexing*/ bool>>
+  loadShard(const tooling::CompileCommand &Cmd,
----------------
Consider creating a simple struct for this pair. The named field access is much more readable than `.first` and `.second`


================
Comment at: clangd/index/Background.h:114
+            BackgroundIndexStorage *IndexStorage,
+            llvm::StringSet<> &FailedShards);
+  llvm::Error loadShards(std::vector<std::string> ChangedFiles);
----------------
Should this be named `VisitedShards`?


================
Comment at: clangd/index/BackgroundIndexStorage.cpp:80
     OS.close();
+    vlog("Written shard {0} - {1}", ShardPath, ShardIdentifier);
     return llvm::errorCodeToError(OS.error());
----------------
NIT: maybe send in as a separate patch to keep this more focused? (No need to send for review, just commit it)


Repository:
  rCTE Clang Tools Extra

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55224/new/

https://reviews.llvm.org/D55224





More information about the cfe-commits mailing list