[PATCH] D53433: [clangd] auto-index stores symbols per-file instead of per-TU.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Oct 22 09:59:04 PDT 2018


sammccall added inline comments.


================
Comment at: clangd/index/Background.h:68
+  FileSymbols IndexedSymbols;
+  FileDigests IndexedFileDigests; // Keyed by file URIs.
 
----------------
Keying this with file URIs seems suspicious to me - why this change?

It seems to be motivated by the current implementation where the partitioning is done post-hoc by looking at strings that contain file URIs.
But this seems likely to change in the long term, and producing data that's sharded in the first place. At that point it'll be just as natural and semantically cleaner just to use file paths, I think.

If performance of converting URI -> path is a concern in the meantime, can we just add a cache to that loop?


================
Comment at: clangd/index/FileIndex.cpp:121
+  if (MergeSymbols) {
+    // Merge slabs into a single slab and only keep alive the merged.
+    SymbolSlab::Builder Merged;
----------------
I don't think you want to build a new slab and copy all the strings here.

This seems like it's saving data (you're not keeping everything alive!) but in practice that only allows GC of data that changes, and most data doesn't change. So mostly this just introduces another copy of the data.

So I think you just want a `vector<Symbol> MergedSymbols`, and either push all the pointers in there or use `concat(make_pointee_range(SymbolPointers), MergedSymbols)`


================
Comment at: clangd/index/FileIndex.h:49
+/// rename the existing FileSymbols to something else e.g. TUSymbols?
+class SymbolsGroupedByFiles {
+public:
----------------
ioeric wrote:
> sammccall wrote:
> > `FileSymbols` isn't actually that opinionated about the data it manages:
> >  - the keys ("filenames") just identify shards so we can tell whether a new shard is an addition or replaces an existing one.
> >  - the values (tu symbols/refs) are merged using pretty generic logic, we don't look at the details.
> > 
> > I think we should be able to use `FileSymbols` mostly unmodified, and store digests in parallel, probably in `BackgroundIndexer`. Is there a strong reason to store "main file" digests separately to header digests?
> > 
> > There are a couple of weak points:
> >  - The merging makes subtle assumptions: for symbols it picks one copy arbitrarily, assuming there are many copies (merging would be expensive) and they're mostly interchangeable duplicates. We could add a constructor or `buildIndex()` param to FileSymbols to control this, similar to the IndexType param. (For refs it assumes no duplicates and simply concatenates, which is also fine for our purposes).
> >  - `FileSymbols` locks internally to be threadsafe while keeping critical sections small. To keep digests in sync with FileSymbols contents we probably have to lock externally with a second mutex. This is a little ugly but not a real problem I think.
> Good idea! Thanks!
> 
> > FileSymbols locks internally to be threadsafe while keeping critical sections small. To keep digests in sync with FileSymbols contents we probably have to lock externally with a second mutex. This is a little ugly but not a real problem I think.
> This doesn't seem to be a problem as `Background::index()` currently assumes single-thread? I can either make it thread-safe now or add a `FIXME`. WDYT?
Right, currently we're only reading/writing with a single thread. The mutex thing is more for the future. Don't bother with a FIXME, when we add multiple threads we'll need to work out what to lock.


================
Comment at: clangd/index/FileIndex.h:41
+using FileDigest = decltype(llvm::SHA1::hash({}));
+using FileDigests = llvm::StringMap<FileDigest>;
+
----------------
as noted elsewhere I think this is the wrong place for this decl.
The values are somewhat self-explanatory, but the keys need to be documented.


================
Comment at: clangd/index/FileIndex.h:68
   // The index keeps the symbols alive.
+  // If \p MergeSymbols is true, this will merge symbols from different files;
+  // otherwise, a random one is picked, which is less accurate but faster to
----------------
can we pull out something like `enum class DuplicateHandling { PickOne, Merge }`?

I think that's much clearer at the callsite than `/*MergeSymbols=*/true`, and not really longer.


================
Comment at: clangd/index/IndexAction.h:12
 #define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_INDEX_ACTION_H
+#include "FileIndex.h"
 #include "SymbolCollector.h"
----------------
This looks like an inverted dependency. The FileDigests struct should be part of symbolcollector, no?


================
Comment at: clangd/index/IndexAction.h:33
+    std::function<void(RefSlab)> RefsCallback,
+    std::function<void(FileDigests)> FileDigestsCallback);
 
----------------
thinking about what we eventually want here:
 - the index action needs to tell the auto-indexer what the file content or digests are
 - the auto-indexer needs to tell the index action which files to index symbols from
 - this decision is in turn based on the digests

What about this design:
 - `createStaticIndexingAction` accepts a filter-function: "should we index symbols from file X"?
 - `SymbolCollector` invokes this once per file and caches the result.
 - the signature is something like `bool(SourceManager&, FileEntry*)`
 - auto-index passes a function that reads the contents from the FileEntry, checks whether the hash is up-to-date to return true/false, and **stores the resulting hashes** for later

That way the caching/hash stuff never actually leaks outside auto-index that needs it, and we get the interface we need to precisely avoid generating Symbols we don't want.

(That function could just always return true in this patch, if the filtering isn't trivial).

It would mean that auto-index would need to redundantly compute the URI once per file, I don't think that matters.


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D53433





More information about the cfe-commits mailing list