[PATCH] D112915: WIP: [clang][modules] Granular tracking of includes

Duncan P. N. Exon Smith via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Nov 3 13:37:18 PDT 2021


dexonsmith added a comment.

Just a few comments on implementation details. The only high-level piece to call out is that I wonder if `NumIncludes` could/should be simplified (semantically) to a Boolean in a prep commit.



================
Comment at: clang/include/clang/Lex/Preprocessor.h:720-723
+  /// Efficient map with FileEntry keys.
+  template <typename T> class FileEntryMap {
+    /// The underlying storage. Entries are indexed by FileEntry UID.
+    std::vector<T> UIDToValue;
----------------
I'm not sure about this choice. UIDs are unlikely to be adjacent and in SubmoduleIncludeState, since a given submodule is unlikely to have "most" files. Also, memory usage characteristics could be "weird" if the FileManager is being reused between different compilations.

`DenseMap<unsigned, unsigned>` will have the same number of allocations (i.e., just 1). If UIDs really are dense and near each other in one of the submodules then that map will be a little bigger than necessary (~2x), but it should be better for the rest of them.


================
Comment at: clang/lib/Lex/Preprocessor.cpp:1323
+        auto &ModNumIncludes = SubmoduleIncludeStates[M].NumIncludes;
+        for (unsigned UID = 0; UID <= ModNumIncludes.maxUID(); ++UID) {
+          CurSubmoduleIncludeState->NumIncludes[UID] += ModNumIncludes[UID];
----------------
jansvoboda11 wrote:
> Iterating over all FileEntries is probably not very efficient, as Volodymyr mentioned. Thinking about how to make this more efficient...
My suggestion above to drop FileEntryMap in favour of a simple DenseMap would help a bit, just iterating through the files actually included by the submodules.

Further, I wonder if "num-includes"/file/submodule (`unsigned`) is actually useful, vs. "was-included"/file/submodule (`bool`). The only observer I see is `HeaderSearch::PrintStats()` but maybe I missed something?

If I'm right and we can switch to `bool`, then NumIncludes becomes a `DenseSet<FileEntry *> IncludedFiles` (or `DenseSet<unsigned>` for UIDs). (BTW, I also wondered if you could rotate the map, using File as the outer key, and then use bitsets for the sbumodules, but I doubt this is better, since most files are only included by a few submodules, right?)

Then you can just do a set union here. Also simplifies bitcode serialization.

(If a `bool`/set is sufficient, then I'd suggest landing that first/separately, before adding the per-submodule granularity in this patch.)


================
Comment at: clang/lib/Serialization/ASTWriter.cpp:2392-2393
+
+  // TODO: Use plain vector, submodule IDs are small and dense.
+  std::map<unsigned, std::map<StringRef, unsigned>> ToBeSerialized;
+
----------------
A vector of maps would be an improvement, but that'll still be a lot of allocations.
- Since insertion/lookup/deletion aren't intermingled, the simplest way to avoid adding unnecessary overhead is a sorted vector (https://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap).
- With no lookups (at all), there's no benefit to a tiered data structure (vs flat).

Leading me toward a simple flat vector + sort.
```
lang=c++
struct IncludeToSerialize { // Probably more straightforward than a std::tuple...
  StringRef Filename;
  unsigned SMID;
  unsigned NumIncludes;

  bool operator<(const IncludeToSerialize &RHS) const {
    if (SMID != RHS.SMID)
      return SMID < RHS.SMID;
    int Diff = Filename.compare(RHS.Filename);
    assert(Diff && "Expected unique SMID / Filename pairs");
    return Diff < 0;
  }
};
SmallVector<IncludeToSerialize> IncludesToSerialize;
// ...
    IncludesToSerialize.push_back({Filename, LocalSMID, NumIncludes});
// ...
llvm::sort(IncludesToSerialize);
for (const IncludeToSerialize &SI : IncludesToSerialize) {
  // emit record
}
```
(Or if there are duplicates expected to be encountered and ignored, you can remove the assertion and use stable_sort + unique + erase.)


================
Comment at: clang/lib/Serialization/ASTWriter.cpp:2418
+                                         NumIncludes};
+      Stream.EmitRecordWithBlob(IncludesAbbrev, Record, Filename);
+    }
----------------
I wonder, will the `Filename` already be serialized elsewhere? Could an ID from that be reused here, rather than writing the filename again? (Maybe that'd need a bigger refactor of some sort to create a filename table?)

Stepping back, it looks like this is always eagerly loaded.
- Could it be lazily-loaded by submodule?
- Could it be lazily-loaded by filename?

In the former case, seems like a single record per submodule makes sense, with a single blob that can be decoded on-demand.

In the latter case, maybe it should be rotated, and stored a single record per filename as a blob that can be lazily decoded:
```
<filename-size> <filename> <num-submodules> (<smid> <num-includes>)+
```


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915



More information about the cfe-commits mailing list