[PATCH] D106064: [clang][deps] Normalize paths in minimizing file system

Mon Jul 19 11:50:08 PDT 2021

dexonsmith accepted this revision.
dexonsmith added a comment.
This revision is now accepted and ready to land.

In D106064#2887753 <https://reviews.llvm.org/D106064#2887753>, @jansvoboda11 wrote:

> With the call to `llvm::sys::path::native` scoped only to `IgnoredFiles`, would this patch LGTY?

Yes, this LGTM once you update that.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:171-172

   bool KeepOriginalSource = IgnoredFiles.count(Filename) ||
                             !shouldMinimize(Filename);
   DependencyScanningFilesystemSharedCache::SharedFileSystemEntry
----------------
jansvoboda11 wrote:
> dexonsmith wrote:
> > jansvoboda11 wrote:
> > > dexonsmith wrote:
> > > > Looking at this, makes me wonder if this is just fixing a specific instance of a more general problem.
> > > > 
> > > > Maybe `IgnoredFiles` should be a set of `FileEntry`s instead of `StringRef`s... but that'd create a different performance bottleneck when the set is big, since creating the FileEntrys would be expensive. We'd want the FileEntry lookup to be globally cached / etc. -- and FileManager isn't quite safe to use globally.
> > > > 
> > > > Do you think IgnoredFiles as-is will work well enough for where it'll be used for PCH? Or do we need to catch headers referenced in two different ways somehow?
> > > I think we could use `llvm::sys::fs::UniqueID` instead of the filename to refer to files. Since the VFS layer resolves symlinks when stat-ing a file, that should be a canonical file identifier. I can tackle that in a follow up patch.
> > Yup, a unique ID should work for a file identifier.
> > 
> > I'm concerned about the cost of looking up the unique ID — avoiding stat traffic was measured to be an important performance benefit in the dependency scanner model.
> > 
> > To avoid a perf regression, I think you could use caches like:
> > - ids: filename -> unique-id
> > - originals: unique-id -> original file content
> > - minimized: unique-id -> minimized file content
> > 
> > Where "ids" and "originals" are read/cached in lock-step when accessing a filename, additionally computing "minimized" if not in the ignore-list. (Adding a file to the ignore-list would put content in "ids" and "originals".)
> > 
> > The goal is to amortize the `stat` cost across the lifetime of the service while ensuring a consistent view of the file content.
> > 
> > WDYT?
> > 
> > ... regardless I think all of this is out of scope for the current patch, which is still useful for unblocking adding tests to the subsequent patches in the stack.
> Yes, this is the cache structure I had in mind.
> 
> I agree that this should be tackled in a follow-up patch. I'm going to create a patch with xfailing test case that demonstrates how one file with two different names (e.g. symlink) can cause issues with the current approach.
Might be nice to include that `XFAIL`'ed test in this patch, as well as a FIXME in the code, documenting the general problem. But if you'd rather land that separately/after it's fine with me.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106064/new/

https://reviews.llvm.org/D106064