[PATCH] D114966: [clang][deps] Split stat and file content caches

Mon Dec 20 10:31:42 PST 2021

dexonsmith requested changes to this revision.
dexonsmith added a comment.
This revision now requires changes to proceed.

I'm liking the new direction here; requesting changes since it looks like the filename is being used to pick the shard for UIDMap, which will lead to multiple opinions of what each UID means.

================
Comment at: clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:332
+
+    TentativeEntry(llvm::vfs::Status Status,
+                   std::unique_ptr<llvm::MemoryBuffer> Contents)
----------------
I'd use `const&` to avoid copying the string on the way in... see below.

================
Comment at: clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:334
+                   std::unique_ptr<llvm::MemoryBuffer> Contents)
+        : Status(std::move(Status)), Contents(std::move(Contents)) {}
+  };
----------------
I think, rather than move/copy the status name, the name should be wiped out to ensure no one relies on it. Every access should use `copyWithNewName()` since this is shared across all things that point to the same UID... so let's use `copyWithNewName()` here to drop the ignored name.

================
Comment at: clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h:359-360
+  findSharedEntryByUID(llvm::vfs::Status Stat) const {
+    return SharedCache.getShard(Stat.getName())
+        .findEntryByUID(Stat.getUniqueID());
+  }
----------------
This doesn't look right to me. UIDs should be sharded independently of the filename they happen to have been reached by; otherwise each filename shard is developing its own idea of what each UID means. Since UID distribution is not uniform, probably the UID shard should be chosen by `hash_value(Stat.getUniqueID()) % NumShards`.

You could use the same sets of shards for UIDMap and FilenameMap, but since they're independent I'd probably do:
- UIDCache: sharded by UID: UIDMap and BumpPtrAllocator for entries (and likely anything else tied to content)
- FilenameCache: sharded by filename: FilenameMap (and perhaps other things tied to filename?)

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:23-24
+  auto MaybeFile = getUnderlyingFS().openFileForRead(Filename);
   if (!MaybeFile)
     return MaybeFile.getError();
   auto File = std::move(*MaybeFile);
----------------
In what circumstances should this return a cached-error TentativeEntry? Any?

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:28-29
   auto MaybeStat = File->status();
   if (!MaybeStat)
     return MaybeStat.getError();
   auto Stat = std::move(*MaybeStat);
----------------
Since the file was opened, should we return cached-error TentativeEntry here, rather than an error?

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:33-34
   auto MaybeBuffer = File->getBuffer(Stat.getName());
   if (!MaybeBuffer)
     return MaybeBuffer.getError();
   auto Buffer = std::move(*MaybeBuffer);
----------------
After a successful stat on the same file descriptor, it definitely feels like this is an error that should be cached, and a TentativeEntry that is in an error state should be returned.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:275
+  const auto &Entry1 = getOrEmplaceSharedEntryForUID(std::move(TEntry));
+  const auto &Entry2 = getOrInsertSharedEntryForFilename(Filename, Entry1);
+  return insertLocalEntryForFilename(Filename, Entry2);
----------------
jansvoboda11 wrote:
> I'm not sure these should be separate. We could end up in situation where the Filename map contains different entry than the UID map for the same directory entry. I'm tempted to merge these functions into one and perform the updates in a single critical section...
> I'm not sure these should be separate. We could end up in situation where the Filename map contains different entry than the UID map for the same directory entry.

I'm also sure precisely what you mean by "for the same directory entry" in this context; and I don't see what's wrong with the situation I think you're outlining.

> I'm tempted to merge these functions into one and perform the updates in a single critical section...

A single critical section for setting UID and filename at the same time would be hard to get right (and efficient), since UIDs have aliases through other filenames due to different directory paths (dir/../x.h vs x.h) and filesystem links (hard and symbolic).

Here's the race that I think(?) you're worried about:

- Worker1 does a tentative stat of "x.h", finds a UID that isn't mapped (UIDX1, but it's ignored...).
- Worker2 does a tentative stat of "x.h", finds a UID that isn't mapped (UIDX1, but it's ignored...).
- Worker1 opens "x.h", finds ContentX1+StatX1 (with UIDX1), saves mapping UIDX1 -> ContentX1+StatX1.
- "x.h" changes.
- Worker2 opens "x.h", finds ContentX2+StatX2 (with UIDX2), saves mapping UIDX2 -> ContentX2+StatX2.
- Worker2 saves mapping "x.h" -> ContentX2+StatX2.
- Both workers move forward with "x.h" -> ContentX2+StatX2.

IIUC, you're concerned that the mapping UIDX1 -> ContentX1+StatX1 was saved. The side effect is that if a future tentative stat of (e.g.) "y.h" returns UIDX1, then "y.h" will be mapped to ContentX1+StatX1. Is this what concerns you? Why? (Is it something else?)

The concern I have is that some filesystems recycle UIDs (maybe "x.h" *was* a symbolic link to "y.h" and then became its own file... or maybe "x.h" and "y.h" were hard links... or maybe "y.h" is just a new file!). But that's a problem with using UIDs to detect equivalent filesystem links / content in general. I don't see any reason to be more concerned here than elsewhere, and to avoid depending on UID we'd need a pretty different design (e.g., lazily detect and model directory structure and symbolic links).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114966/new/

https://reviews.llvm.org/D114966