[PATCH] D153114: [clangd] [C++20] [Modules] Support C++20 modules for clangd

Thu Aug 17 06:57:29 PDT 2023

sammccall added a comment.

Sorry for the long radio silence here. There's a lot to chew on, and I put it off too long. Thanks for your patience!

I agree we should get experimental modules support landed in some form on the LLVM 18 timeline.
It's fine if this doesn't have extension points for large projects, or for header modules, and if it'll take some refactoring later to get there. Still, I think we should consider them in the architecture (partly because a second impl helps reason about interfaces).

I do think it's important we play nicely with the rest of clangd infrastructure: things like threading/context, VFS clean-ness, ability to easily write gtests for modules-enabled ASTs.

This initial pass probably mostly comes across as a list of complaints... I'd like to be more constructive & provide ideas, but this is long already so I'll follow up with that.

I've tried to limit to mostly high-level comments (or at least details that require a bit of thought!)
A couple more that don't fit anywhere:

---

Indexing
--------

Much of clangd's functionality relies on having an up-to-date index of the transitively included headers (and now, modules).
While we're more relaxed about having a full-project index, this preamble index really is essential.

It's infeasible to build this index from the PCM files themselves: this means fully deserializing them and is too expensive (we tried this with PCH).
Traditionally we indexed the preamble the ASTContext before serializing the PCH, though recently this is optionally async.
(You can see this in `onPreambleAST` in ClangdServer.cpp and trace from there.)

I think indexing can be left out of this initial patch, but we should work out where/how it fits and leave a comment!

---

Scope and incremental development
---------------------------------

There are a lot of moving pieces to this. That's useful to understand how they all fit together (which is necessary whether they're in the same patch or not).
However it makes difficult to review, and to drive a review to a conclusion. I think we should change the scope of the initial patch to establish a small subset that works and we understand well:

**Don't attempt any cross-file or cross-version coordination**: i.e. don't try to reuse BMIs between different files, don't try to reuse BMIs between (preamble) reparses of the same file, don't try to persist the module graph. Instead, when building a preamble, synchronously scan for the module graph, build the required PCMs on the single preamble thread with filenames private to that preamble, and then proceed to build the preamble.

Obviously this no longer provides any performance benefits from modules! But it gets us to a modules-based build that works, and these aspects can be added on top in a relatively clear way.
It avoids a lot of tricky problems:

- scheduling
- (lack of) transactionality of reads/writes from the module graph
- the possibility of BMI version conflicts
- various events that can invalidate the module graph
- rebuilding of BMIs when source files change (the current patch deliberately doesn't do this to keep the scope down, but that means the result is both complex and incorrect, which is a difficult situation to get out of)

I think this also encourages us to unbundle some of the "ModulesManager"'s responsibilities and will lead to a better API there.

**Do consider the lifecycle/namespacing of BMI files on disk**

While we may eventually want to reuse BMIs between (sequential or concurrent) clangd instances, this is complicated and IMO won't happen soon (needs to deal with source versioning, I believe).

Therefore for now we should treat the BMIs as transient, conceptually like in-memory objects. In practice we want to store them on disk for size reasons, but these should have transient tempfile names that will never conflict, should be deleted on shutdown etc. (It's fine to do basic delete-on-destruction and leave crash-handling etc out for now).

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:50
+
+  DependencyScanningTool ScanningTool(Service);
+
----------------
the docs suggest this tool is going to spawn threads.

clangd needs to retain control of thread spawning, e.g. to ensure clangd::Context is propagated (required for config, tracing and others), thread priorities are set appropriately etc.

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:55
+  llvm::Expected<P1689Rule> P1689Rule =
+      ScanningTool.getP1689ModuleDependencyFile(
+          *Cmd, Cmd->Directory, MakeformatOutput, MakeformatOutputPath);
----------------
It looks like this performs IO for the dependency scanning.
The files may not be on a real filesystem, IO on source files should go through the configured VFS (see ThreadsafeFS from which you can obtain a concrete FS for the desired operation).

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:56
+      ScanningTool.getP1689ModuleDependencyFile(
+          *Cmd, Cmd->Directory, MakeformatOutput, MakeformatOutputPath);
+
----------------
it seems like the library under some circumstances expects to be able to write output to physical disk, and choose the path to do so (MakeformatOutputPath?)

What purpose does this serve here, and is it possible to avoid it?

While potentially it's OK to write to a temporary directory, we need to ensure concurrent clangd instances don't conflict, the files get eventually cleaned up after crashes, we're not writing too much in environments without physical disks etc.

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:106
+// TODO: Try to use the existing BMIs, which is created in last invocation of
+// clangd.
+void ModulesManager::InitializeGraph(PathRef Filename) {
----------------
note that clangd invocations are not necessarily sequential, there may be multiple clangd instances operating on the same codebase at the same time.

So we either need to ensure this store can be shared across instances without races, or treat it as transient (separate dir for each clangd process, clean up on shutdown and on crash)

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:320
+    llvm::StringRef Filename = Worklist.pop_back_val();
+    Nodes[Filename.str()]->BMIPath.reset();
+
----------------
This seems to be the only place that BMI path is cleared, which will cause the BMI to be rebuilt next time it is needed, correct?

We need to rebuild the BMI every time the content of a module has changed, though: we need the exposed APIs to be up-to-date (and the SourceLocations etc).
Nothing seems to be doing this, so it seems we're going to keep using stale copies of headers forever.

(CDB.watch isn't anywhere close to sufficient here: it's best effort, it's async, and it only tells when compile flags changed, not the file content).

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:525
+
+  assert(!HasInvalidDependencies(Path) &&
+         "It is meaningless to require to generate module interface if there "
----------------
this doesn't seem like a valid thing to assert here, or anywhere

we don't hold the lock at this point, so the graph may have changed since whatever precondition the caller established.
In particular, it may now have invalid dependencies.

(This seems like a manifestation of the idea that we need some idea of snapshots in the graph and versioning of the BMIs we create)

================
Comment at: clang-tools-extra/clangd/ModulesManager.cpp:561
+
+  for (const auto &Name : Names)
+    GenerateModuleInterface(Name);
----------------
How do we handle the following scenario?

1. we get a call to GenerateModuleInterfacesFor("a.cpp"), where a.cpp => X => Z
2. we recursively build Z.pcm#1 and X.pcm and we're just about to return...
3. now somehow we find out that Z changed (and clear its BMI path)
4. and we get a call to GenerateModuleInferfacesFor("b.cpp"), where b.cpp => Y => Z
5. now we recursively build Z.pcm#2 and Y.pcm
6. finally the original call to GenerateModuleInterfacesFor("a.cpp") returns

---

- In this end state, we have modules for X and Z, but they can't be used together: X.pcm was build against Z.pcm#1 but we have Z.pcm#2. So a.cpp produces a mysterious PCM-related error.
- If we solve this by invalidating X's BMI path when we invalidate Z's, then we just end up with a.cpp failing to compile with a missing PCM.
- if we solve this by having GenerateModuleInterfacesFor basically loop until everything is up to date, then we stop guaranteeing forward progress when the system is busy, and *still* race because things can get out of date as soon as we return

I believe solving this class of problem probably means explicitly accounting for versions, and keeping refcounted versioned PCMs that are pinned while we're using them.

================
Comment at: clang-tools-extra/clangd/ModulesManager.h:28
+
+namespace detail {
+/// A simple wrapper for AsyncTaskRunner and Semaphore to run modules related
----------------
these details seem unused outside the cpp file (and untested), so should be in the cpp file - this would make it easier to understand which part of this file is the interface

================
Comment at: clang-tools-extra/clangd/ModulesManager.h:56
+/// its owner.
+class ModulesDependencyGraph {
+  struct ModulesDependencyNode {
----------------
Concretely, for the implementation where we scan the whole project up front, the data structure here can be an in-memory graph.

However rather than having fine-grained methods to query attributes of the graph, I think we should work out which high-level/coarse-grained queries we should support.

The current API leads to locking/unlocking the graph many separate times to obtain/update different information, and it's very difficult to reason about whether the behavior is correct if the graph changes in between. Coarse-grained single queries can be made atomic and queried once.

This will also result in an interface much more suitable for other implementation strategies, such as querying a build system that understands modules. (We will need to have this available as an extension point, as scanning the whole project does not scale to larger codebases)

================
Comment at: clang-tools-extra/clangd/ModulesManager.h:116
+  ///
+  /// Note that this doesn't include third party modules for which the source
+  /// codes can't be found.
----------------
I'd like to choose a different name than "third-party", which often (at least inside LLVM and inside google) refers to "vendored" libraries where we *do* always have headers and often build entirely from source.

Moreover, I think at least initially we should not support these in any way, and ensure we produce an error for the import statement.
In the majority and baseline case, the compiler may not be clang, and is probably at least a different version with incompatible BMI files.
(There may eventually be cases where we want to try making use of externally provided BMIs, but this is mostly orgthogonal to third-party-ness, I tihnk)

This means the signature here shouldn't need to call out third-party modules as special - these should be modeled the same way as if there was a dependency we couldn't build a BMI for for some other reason.

================
Comment at: clang-tools-extra/clangd/ModulesManager.h:204
+/// ModulesManager to decide whether or not it is related to modules.
+class ModulesManager {
+public:
----------------
Generally "ModulesManager" is worryingly vague as a set of responsibilities, and this class looks like it does too much: provides threadsafe access to the module graph, performs actual scanning and BMI-building logic, and also acts as the scheduler.

I think we'll need to separate these concerns out into separate APIs, but it's probably best if we think about this later after understanding how to address more concrete questions.

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:867
+    if (auto *ModuleMgr = CDB.getModulesManager();
+        ModuleMgr && !ModuleMgr->IsReadyToCompile(FileName) &&
+        !ModuleMgr->HasInvalidDependencies(FileName)) {
----------------
This call to IsReadyToCompile seems to have a logic race:
 - we've previously queried the module manager as part of preparing the compile command
 - now we're querying it again based on the filename
 - what happens if the graph has changed significantly in the meantime?

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:871
+
+      ModuleMgr->addCallbackAfterReady(FileName,
+                                       [this](bool) { notifyModulesBuilt(); });
----------------
making ModulesCV a member etc, exposing waitForModulesBuilt etc suggests we're going to wait on this for multiple threads. But I don't see that here.

If this is merely local, can we have:
```
Notification N;
addCallbackAfterReady([] { N.Notify(); });
N.wait();
```
or even just give ModuleMgr a blocking API?

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:874
+      ModuleMgr->GenerateModuleInterfacesFor(FileName);
+      waitForModulesBuilt();
+    }
----------------
we need to be able to handle shutdown while waiting for modules

================
Comment at: clang-tools-extra/clangd/TUScheduler.cpp:874
+      ModuleMgr->GenerateModuleInterfacesFor(FileName);
+      waitForModulesBuilt();
+    }
----------------
sammccall wrote:
> we need to be able to handle shutdown while waiting for modules
here the ASTWorker is apparently blocking on all the modules being up-to-date (though the impl doesn't seem to actually achieve up-to-date-ness, I assume that's the intent).

This creates a big performance cliff where modifying a low-level header causes all features to stop working until everything that transitively depends on it is rebuilt. We used to have this cliff with preambles, though extensive effort we've eliminated it through the use of a separate preamble thread, tolerance for stale preambles, preamble patching etc.

I think the fix is roughly:
 - we assume imports are in the preamble (I think the language guarantees this idea works, though clang's actual preamble implementation may or may not support this right now)
 - the preambleworker should be blocking on modules being ready, not the astworker. This (usually) takes it off the critical interactive path
 - typically the preamble will now be a fairly small thing that's mostly references to PCMs
 - we need to ensure the PCMs survive (and aren't overwritten by newer versions) for as long as the preamble is alive and used. Some sort of owner class stored in PreambleData can achieve this.

(one day it may be possible to eliminate the use of preamble altogether in favor of some modules-based solution, with header modules inferred for non-modularized code. But obviously we can't do that until this is non-experimental, so the preamble is the best place for modules to live for now)

================
Comment at: clang-tools-extra/clangd/test/modules.test:1
+# A smoke test to check the modules can work basically.
+#
----------------
A smoke lit test is great, it's not realistic to achieve good test coverage of features this way though (and generally we don't try).

We'll need to work out how to get TestTU-based tests to work with modular builds.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153114/new/

https://reviews.llvm.org/D153114