[PATCH] D105177: [clangd] Implemented indexing of standard library

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Aug 5 05:15:39 PDT 2021


sammccall added a comment.

Sorry about the delay here, and welcome back!

The core design looks sensible to me - we talked about making this part of the background index (which could reuse more code) but we don't actually want the data mixed together.

Main points in the implementation are:

- simplify the exposed interface
- i don't think we need to mess with VFS at all actually
- we should think a little about which index data we actually want to keep/use

Next design questions seem to be about lifetime/triggering:

- how many configurations of stdlib index to have
- when do we build the indexes, and who owns them
- what logic governs whether/which stdlib index is triggered, and where do we put it



================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:30
+
+const std::string
+    StandardLibraryIndex::StdLibHeaderFileName("/stdlibheaders.cpp");
----------------
nit: use char[] or llvm::stringliteral for constants to avoid global constructors/destructors

(see "static initialization order fiasco" and related destructor issues)


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:31
+const std::string
+    StandardLibraryIndex::StdLibHeaderFileName("/stdlibheaders.cpp");
+
----------------
this path should probably use native conventions: "/stdlibheaders.cpp" isn't a good filename on windows.

Since it doesn't really need to exist, maybe #ifdef _WIN32 like we do for testRoot in unittests/TestFS would be nice. Such a virtualRoot() could go in FS.h if you like!

Would be nice to have "virtual" in the path so it's clear in e.g. error messages that it's not a real path.


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:33
+
+std::string StandardLibraryIndex::generateIncludeHeader() {
+  std::string Includes;
----------------
This generates the same result every time, if there's any chance we'll call it multiple times (i think there is) maybe rather:

```
llvm::StringRef getIncludeHeader() {
  // construct on first use, never destroy
  static std::string *once = []{
     ...
     return new std::string(std::move(result));
  };
  return *once;
};
```


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:36
+  // gather all the known STL headers in a set for depulication
+  std::set<std::string> Headers;
+
----------------
std::set is a poor (wasteful) data structure.

here a simple reasonably-efficient thing would be
vector<StringLiteral> Headers = { ... };
llvm::sort(Headers);
llvm::unique(Headers);

Same big-O, still simple, smaller-faster in practice.


The slightly silly thing is doing this at runtime at all, but that would be a fair amount of work to avoid.


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:58
+
+  llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> viewImpl() const override {
+    return FS;
----------------
This isn't actually threadsafe though?
Calls to viewImpl() are supposed to return filesystems with independent state (workdir).

Maybe it doesn't actually bite us here but I don't see much cost to actually building the FS on demand. (The threadsafeFS should own the string and use a non-copying buffer)


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:66
+StandardLibraryIndex::buildFilesystem(std::string HeaderSources) {
+  auto FS = llvm::makeIntrusiveRefCnt<llvm::vfs::InMemoryFileSystem>();
+  auto Now =
----------------
this contains only the umbrella headers, and doesn't seem to be overlaid on the real filesystem. So how does this work?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:82
+  // TODO: can we get a real compile command from somewhere?
+  Inputs.CompileCommand = tooling::CompileCommand();
+  Inputs.CompileCommand.Directory = "/";
----------------
this is a no-op


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:83
+  Inputs.CompileCommand = tooling::CompileCommand();
+  Inputs.CompileCommand.Directory = "/";
+  Inputs.CompileCommand.Filename = StdLibHeaderFileName;
----------------
/ may not be what we want on windows right?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:85
+  Inputs.CompileCommand.Filename = StdLibHeaderFileName;
+  Inputs.CompileCommand.Output = StdLibHeaderFileName + ".o";
+  Inputs.CompileCommand.CommandLine.push_back("clang++");
----------------
this is not used, no need to set it


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:88
+  Inputs.CompileCommand.CommandLine.push_back(Inputs.CompileCommand.Filename);
+  Inputs.CompileCommand.CommandLine.push_back("-o");
+  Inputs.CompileCommand.CommandLine.push_back(Inputs.CompileCommand.Output);
----------------
no need for the -o either


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:98
+  auto Buffer = FS->getBufferForFile(StdLibHeaderFileName);
+  if (!Buffer)
+    return error("Could not read file from InMemoryFileSystem");
----------------
CreateMemBuffer instead to avoid expressing this impossible error condition?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:101
+  auto Clang = prepareCompilerInstance(std::move(CI), /*Preamble=*/nullptr,
+                                       std::move(*Buffer), FS, IgnoreDiags);
+  if (!Clang)
----------------
hang on, if we're providing the overridden buffer to prepareCompilerInstance (like we do for edited files in clangd that may or may not be saved on disk yet) then why do we need the VFS at all?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:109
+      IndexOpts, [&](SymbolSlab S) { IndexSlabs.Symbols = std::move(S); },
+      [&](RefSlab R) { IndexSlabs.Refs = std::move(R); },
+      [&](RelationSlab R) { IndexSlabs.Relations = std::move(R); },
----------------
I don't think storing refs originating in the standard library is profitable overall.
Functionally I can think of cases it makes better & worse, but it's not key to our core use cases and it's the majority of index size IIRC.


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:110
+      [&](RefSlab R) { IndexSlabs.Refs = std::move(R); },
+      [&](RelationSlab R) { IndexSlabs.Relations = std::move(R); },
+      [&](IncludeGraph IG) { IndexSlabs.Sources = std::move(IG); });
----------------
similarly we don't need relations for this index i believe


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:111
+      [&](RelationSlab R) { IndexSlabs.Relations = std::move(R); },
+      [&](IncludeGraph IG) { IndexSlabs.Sources = std::move(IG); });
+
----------------
we don't actually use the include graph afaics


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:113
+
+  // We're going to run clang here, and it could potentially crash.
+  // We could use CrashRecoveryContext to try to make indexing crashes nonfatal,
----------------
This comment is copied from background index, we don't need it everywhere and it's less relevant here (we're not really compiling arbitrary broken code)


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:126
+
+  IndexSlabs.Cmd = Inputs.CompileCommand;
+  assert(IndexSlabs.Symbols && IndexSlabs.Refs && IndexSlabs.Sources &&
----------------
This never gets used, why are we setting it?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:130
+
+  log("Indexed {0} ({1} symbols, {2} refs, {3} files)",
+      Inputs.CompileCommand.Filename, IndexSlabs.Symbols->size(),
----------------
This should mention the standard library


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:133
+      IndexSlabs.Refs->numRefs(), IndexSlabs.Sources->size());
+
+  trace::Span Tracer("StandardLibraryIndex");
----------------
we could filter Symbols to only include those in our list, rather than all the private implementation cruft. (I expect cruft is the majority by weight)
Not in this patch, but maybe a fixme?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:134
+
+  trace::Span Tracer("StandardLibraryIndex");
+  SPAN_ATTACH(Tracer, "symbols", int(IndexSlabs.Symbols->size()));
----------------
this tracer has the wrong lifetime and won't measure the actual indexing time, hoist it to the top?


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:142
+  if (HadErrors) {
+    log("Failed to compile standard librarx index, index may be incomplete");
+  }
----------------
nit: librarx->library


================
Comment at: clang-tools-extra/clangd/index/StdLib.cpp:145
+
+  auto Index = MemIndex::build(std::move(IndexSlabs.Symbols.getValue()),
+                               std::move(IndexSlabs.Refs.getValue()),
----------------
this should probably be Dex instead of MemIndex unless it's tiny


================
Comment at: clang-tools-extra/clangd/index/StdLib.h:7
+//
+//===----------------------------------------------------------------------===//
+
----------------
This is where an overview of the feature would go :-)


================
Comment at: clang-tools-extra/clangd/index/StdLib.h:19
+namespace clangd {
+
+class StandardLibraryIndex {
----------------
the interface here will probably want to grow to include some sort of enum/config struct for the language/version/whatever we decide to support (at minimum I'd think C vs C++). Fine to just hardcode c++ for now, maybe leave a comment.


================
Comment at: clang-tools-extra/clangd/index/StdLib.h:21
+class StandardLibraryIndex {
+  // TODO: do we really need a class? Or would plain functions be good enough?
+  // TODO: do we really need to make verything public just for unittesting?
----------------
whether you want to use a class for implementation or not, the *interface* exposed by the header file seems like it's just a function (or possibly several if you want to test some details).


================
Comment at: clang-tools-extra/clangd/index/StdLib.h:25
+  /* virtual file name for indexing */
+  static const std::string StdLibHeaderFileName;
+
----------------
nit: i'd call this "entry point" or something, because "std lib header file" already means something.

"umbrella header" is a common term for an entry point header designed to include a lot of other headers


================
Comment at: clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp:33
+/// test building the virtual file system
+TEST(StdLibIndexTests, buildFilesystem) {
+  auto Sli = StandardLibraryIndex();
----------------
I don't think this is a very useful test - it's mostly asserting implementation details, and if the FS is broken the e2e will fail in a really obvious way.


================
Comment at: clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp:44
+TEST(StdLibIndexTests, buildIndex) {
+  auto Sli = StandardLibraryIndex();
+  // TODO: maybe find a way to use a local libcxx for testing if that is
----------------
nit: llvm/clangd don't generally use this auto-everywhere style, rather `StandardLibraryIndex Sli`.
And this would be SLI rather than Sli.


================
Comment at: clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp:47
+  //       available on the machine
+  std::string HeaderMock = R"CPP(
+    int myfunc(int a);
----------------
I'm not sure we're testing what we want to be testing here.

In real life, the symbols are not going to be in the entrypoint, but a file included from it.
And clangd's indexer *does* treat the main file differently from others.

It's pretty awkward to actually fix but maybe worth a comment.


================
Comment at: clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp:55
+  FuzzyFindRequest Req;
+  Req.Query = "myfunc";
+  Req.AnyScope = true;
----------------
not sure why we're setting a filter here - we don't need to test that fuzzyfind works


================
Comment at: clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp:56
+  Req.Query = "myfunc";
+  Req.AnyScope = true;
+  Req.Limit = 100;
----------------
AnyScope and Limit are not needed


================
Comment at: clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp:58
+  Req.Limit = 100;
+  std::vector<std::string> Matches;
+  Index.get()->fuzzyFind(
----------------
```
#include "TestIndex.h"
MATCHER_P(Named, N, "") { return arg.Name == N; }
...
EXPECT_THAT(match(*Index, FuzzyFindRequest{}), ElementsAre(Named("myfunc"), Named("otherfunc")));
```

EXPECT_THAT/ElementsAre give much better error messages (e.g. if there are extra elements, it'll tell you what they are)
If you don't care if there are extra elements, just drop the assertion on size


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105177/new/

https://reviews.llvm.org/D105177



More information about the cfe-commits mailing list