[PATCH] D112915: [clang][modules] Track number of includes per submodule

Tue Nov 9 01:53:21 PST 2021

jansvoboda11 marked 2 inline comments as done.
jansvoboda11 added a comment.

Thanks for your feedback @dexonsmith. I reworked the patch to use more sensible data structures as suggested, and made the AST deserialization lazy (on (sub)module import).

I think the only thing to figure out is the exact structure of the serialized information - whether we're fine with serializing all transitively included files in each AST file, or whether we'd like to fetch that information from other AST files instead.

I removed the WIP tag and would be happy to gather more feedback.

================
Comment at: clang/include/clang/Lex/Preprocessor.h:720-723
+  /// Efficient map with FileEntry keys.
+  template <typename T> class FileEntryMap {
+    /// The underlying storage. Entries are indexed by FileEntry UID.
+    std::vector<T> UIDToValue;
----------------
dexonsmith wrote:
> I'm not sure about this choice. UIDs are unlikely to be adjacent and in SubmoduleIncludeState, since a given submodule is unlikely to have "most" files. Also, memory usage characteristics could be "weird" if the FileManager is being reused between different compilations.
> 
> `DenseMap<unsigned, unsigned>` will have the same number of allocations (i.e., just 1). If UIDs really are dense and near each other in one of the submodules then that map will be a little bigger than necessary (~2x), but it should be better for the rest of them.
You're right that UIDs of files included in a single submodule are unlikely to have "most" files. Thanks for pointing that out.

In the latest revision, I switched to `llvm::DenseMap<const FileEntry *, unsigned>`.

================
Comment at: clang/lib/Lex/Preprocessor.cpp:1323
+        auto &ModNumIncludes = SubmoduleIncludeStates[M].NumIncludes;
+        for (unsigned UID = 0; UID <= ModNumIncludes.maxUID(); ++UID) {
+          CurSubmoduleIncludeState->NumIncludes[UID] += ModNumIncludes[UID];
----------------
dexonsmith wrote:
> jansvoboda11 wrote:
> > Iterating over all FileEntries is probably not very efficient, as Volodymyr mentioned. Thinking about how to make this more efficient...
> My suggestion above to drop FileEntryMap in favour of a simple DenseMap would help a bit, just iterating through the files actually included by the submodules.
> 
> Further, I wonder if "num-includes"/file/submodule (`unsigned`) is actually useful, vs. "was-included"/file/submodule (`bool`). The only observer I see is `HeaderSearch::PrintStats()` but maybe I missed something?
> 
> If I'm right and we can switch to `bool`, then NumIncludes becomes a `DenseSet<FileEntry *> IncludedFiles` (or `DenseSet<unsigned>` for UIDs). (BTW, I also wondered if you could rotate the map, using File as the outer key, and then use bitsets for the sbumodules, but I doubt this is better, since most files are only included by a few submodules, right?)
> 
> Then you can just do a set union here. Also simplifies bitcode serialization.
> 
> (If a `bool`/set is sufficient, then I'd suggest landing that first/separately, before adding the per-submodule granularity in this patch.)
For each file, we need to have three distinct states: not included at all, included exactly once (`firstTimeLexingFile`), included more than once. This means we can't use a single `DenseSet`.

But we could use a `DenseMap<Key, bool>`, where "not included at all" can be expressed as being absent from the map, exactly once as having `true` in the map and more than once as having `false` in the map. Alternatively, we could use two `DenseSet` instances to encode the same, but I don't think having two lookups per file to determine stuff is efficient.

I can look into this in a follow-up patch.

================
Comment at: clang/lib/Serialization/ASTWriter.cpp:2392-2393
+
+  // TODO: Use plain vector, submodule IDs are small and dense.
+  std::map<unsigned, std::map<StringRef, unsigned>> ToBeSerialized;
+
----------------
dexonsmith wrote:
> A vector of maps would be an improvement, but that'll still be a lot of allocations.
> - Since insertion/lookup/deletion aren't intermingled, the simplest way to avoid adding unnecessary overhead is a sorted vector (https://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap).
> - With no lookups (at all), there's no benefit to a tiered data structure (vs flat).
> 
> Leading me toward a simple flat vector + sort.
> ```
> lang=c++
> struct IncludeToSerialize { // Probably more straightforward than a std::tuple...
>   StringRef Filename;
>   unsigned SMID;
>   unsigned NumIncludes;
> 
>   bool operator<(const IncludeToSerialize &RHS) const {
>     if (SMID != RHS.SMID)
>       return SMID < RHS.SMID;
>     int Diff = Filename.compare(RHS.Filename);
>     assert(Diff && "Expected unique SMID / Filename pairs");
>     return Diff < 0;
>   }
> };
> SmallVector<IncludeToSerialize> IncludesToSerialize;
> // ...
>     IncludesToSerialize.push_back({Filename, LocalSMID, NumIncludes});
> // ...
> llvm::sort(IncludesToSerialize);
> for (const IncludeToSerialize &SI : IncludesToSerialize) {
>   // emit record
> }
> ```
> (Or if there are duplicates expected to be encountered and ignored, you can remove the assertion and use stable_sort + unique + erase.)
In the latest revision, I ended up sorting just based on `Filename`, since this is now explicitly stored per submodule.

================
Comment at: clang/lib/Serialization/ASTWriter.cpp:2418
+                                         NumIncludes};
+      Stream.EmitRecordWithBlob(IncludesAbbrev, Record, Filename);
+    }
----------------
dexonsmith wrote:
> I wonder, will the `Filename` already be serialized elsewhere? Could an ID from that be reused here, rather than writing the filename again? (Maybe that'd need a bigger refactor of some sort to create a filename table?)
> 
> Stepping back, it looks like this is always eagerly loaded.
> - Could it be lazily-loaded by submodule?
> - Could it be lazily-loaded by filename?
> 
> In the former case, seems like a single record per submodule makes sense, with a single blob that can be decoded on-demand.
> 
> In the latter case, maybe it should be rotated, and stored a single record per filename as a blob that can be lazily decoded:
> ```
> <filename-size> <filename> <num-submodules> (<smid> <num-includes>)+
> ```
We already serialize `Filename` elsewhere, but only for local input files. Here we need transitive closure of all included input files.

I'm still unsure whether it's fine to store transitively included files here or if we should look that information up in the respective AST files.

The current solution looks like it will bloat sizes of the AST files, but I think the transitive closure is already being stored for `HeaderFileInfo` anyways, so it shouldn't be that big of a deal?

================
Comment at: clang/lib/Serialization/ASTWriter.cpp:2418
+                                         NumIncludes};
+      Stream.EmitRecordWithBlob(IncludesAbbrev, Record, Filename);
+    }
----------------
jansvoboda11 wrote:
> dexonsmith wrote:
> > I wonder, will the `Filename` already be serialized elsewhere? Could an ID from that be reused here, rather than writing the filename again? (Maybe that'd need a bigger refactor of some sort to create a filename table?)
> > 
> > Stepping back, it looks like this is always eagerly loaded.
> > - Could it be lazily-loaded by submodule?
> > - Could it be lazily-loaded by filename?
> > 
> > In the former case, seems like a single record per submodule makes sense, with a single blob that can be decoded on-demand.
> > 
> > In the latter case, maybe it should be rotated, and stored a single record per filename as a blob that can be lazily decoded:
> > ```
> > <filename-size> <filename> <num-submodules> (<smid> <num-includes>)+
> > ```
> We already serialize `Filename` elsewhere, but only for local input files. Here we need transitive closure of all included input files.
> 
> I'm still unsure whether it's fine to store transitively included files here or if we should look that information up in the respective AST files.
> 
> The current solution looks like it will bloat sizes of the AST files, but I think the transitive closure is already being stored for `HeaderFileInfo` anyways, so it shouldn't be that big of a deal?
Thinking about it some more, the current implementation will be duplicating a lot of filenames between submodules of the same module. We might need to extract that to some common storage that we can refer to with simple integer offsets...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112915/new/

https://reviews.llvm.org/D112915