[PATCH] D114966: [clang][deps] Split stat and file content caches

Tue Dec 21 07:55:54 PST 2021

jansvoboda11 added inline comments.

================
Comment at: clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:359-360
+  findSharedEntryByUID(llvm::vfs::Status Stat) const {
+    return SharedCache.getShard(Stat.getName())
+        .findEntryByUID(Stat.getUniqueID());
+  }
----------------
dexonsmith wrote:
> This doesn't look right to me. UIDs should be sharded independently of the filename they happen to have been reached by; otherwise each filename shard is developing its own idea of what each UID means. Since UID distribution is not uniform, probably the UID shard should be chosen by `hash_value(Stat.getUniqueID()) % NumShards`.
> 
> You could use the same sets of shards for UIDMap and FilenameMap, but since they're independent I'd probably do:
> - UIDCache: sharded by UID: UIDMap and BumpPtrAllocator for entries (and likely anything else tied to content)
> - FilenameCache: sharded by filename: FilenameMap (and perhaps other things tied to filename?)
Hmm, skimming through `RealFileSystem::status`, I saw that it's calling `sys::fs::status` with "follow symlinks" enabled. It made sense to me that the name stored in `llvm::vfs::Status` would match that and refer the fully resolved target entry, not the symlink itself. Seeing as this is not the case, I agree the UID itself should be used for choosing the shard.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:23-24
+  auto MaybeFile = getUnderlyingFS().openFileForRead(Filename);
   if (!MaybeFile)
     return MaybeFile.getError();
   auto File = std::move(*MaybeFile);
----------------
dexonsmith wrote:
> In what circumstances should this return a cached-error TentativeEntry? Any?
I don't think the distinction matters at this level. Whether failures should be cached is a decision that's being made one level up.

I personally prefer having `TentativeEntry` to be "non-fallible" and explicitly wrapping the whole thing in `ErrorOr`. That makes it easier for others to know what they are working with (i.e. this object cannot represent an error state). Eventually, I think this would make sense for the caches and `CachedFileSystemEntry` too.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:28-29
   auto MaybeStat = File->status();
   if (!MaybeStat)
     return MaybeStat.getError();
   auto Stat = std::move(*MaybeStat);
----------------
dexonsmith wrote:
> Since the file was opened, should we return cached-error TentativeEntry here, rather than an error?
Let's return an error here and create error `CachedFileSystemEntry` one level up.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:275
+  const auto &Entry1 = getOrEmplaceSharedEntryForUID(std::move(TEntry));
+  const auto &Entry2 = getOrInsertSharedEntryForFilename(Filename, Entry1);
+  return insertLocalEntryForFilename(Filename, Entry2);
----------------
dexonsmith wrote:
> jansvoboda11 wrote:
> > I'm not sure these should be separate. We could end up in situation where the Filename map contains different entry than the UID map for the same directory entry. I'm tempted to merge these functions into one and perform the updates in a single critical section...
> > I'm not sure these should be separate. We could end up in situation where the Filename map contains different entry than the UID map for the same directory entry.
> 
> I'm also sure precisely what you mean by "for the same directory entry" in this context; and I don't see what's wrong with the situation I think you're outlining.
> 
> > I'm tempted to merge these functions into one and perform the updates in a single critical section...
> 
> A single critical section for setting UID and filename at the same time would be hard to get right (and efficient), since UIDs have aliases through other filenames due to different directory paths (dir/../x.h vs x.h) and filesystem links (hard and symbolic).
> 
> Here's the race that I think(?) you're worried about:
> 
> - Worker1 does a tentative stat of "x.h", finds a UID that isn't mapped (UIDX1, but it's ignored...).
> - Worker2 does a tentative stat of "x.h", finds a UID that isn't mapped (UIDX1, but it's ignored...).
> - Worker1 opens "x.h", finds ContentX1+StatX1 (with UIDX1), saves mapping UIDX1 -> ContentX1+StatX1.
> - "x.h" changes.
> - Worker2 opens "x.h", finds ContentX2+StatX2 (with UIDX2), saves mapping UIDX2 -> ContentX2+StatX2.
> - Worker2 saves mapping "x.h" -> ContentX2+StatX2.
> - Both workers move forward with "x.h" -> ContentX2+StatX2.
> 
> IIUC, you're concerned that the mapping UIDX1 -> ContentX1+StatX1 was saved. The side effect is that if a future tentative stat of (e.g.) "y.h" returns UIDX1, then "y.h" will be mapped to ContentX1+StatX1. Is this what concerns you? Why? (Is it something else?)
> 
> The concern I have is that some filesystems recycle UIDs (maybe "x.h" *was* a symbolic link to "y.h" and then became its own file... or maybe "x.h" and "y.h" were hard links... or maybe "y.h" is just a new file!). But that's a problem with using UIDs to detect equivalent filesystem links / content in general. I don't see any reason to be more concerned here than elsewhere, and to avoid depending on UID we'd need a pretty different design (e.g., lazily detect and model directory structure and symbolic links).
Yes, that's the kind of scenario I was thinking about. I'm not concerned about consequences of that side effect, I just don't like storing garbage that will most likely never be used/referenced again and might be confusing during debugging.

I agree with you on UID recycling...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114966/new/

https://reviews.llvm.org/D114966