[PATCH] D63907: [clang-scan-deps] Implementation of dependency scanner over minimized sources

Thu Aug 1 09:26:46 PDT 2019

aganea added inline comments.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:84
+    DependencyScanningFilesystemSharedCache() {
+  NumShards = 8;
+  CacheShards = llvm::make_unique<CacheShard[]>(NumShards);
----------------
arphaman wrote:
> aganea wrote:
> > This line needs a comment. Is this value based on empirical results across different hardwares? (I can't recall your answer at the LLVM conf) I am wondering what would be the good strategy here? The more cores/HT in your PC, the more chances that you'll hit the same shard, thus locking. But the bigger we make this value, the more `StringMaps` we will have, and more cache misses possibly.
> > Should it be something like `llvm::hardware_concurrency() / some_fudge`? It'd be interesting to subsequently profile on high core count machines (or maybe you have done that).
> I rewrote it to use a heuristic we settled on after doing empirical testing on an 18 core / 36 thread machine ( max(2, thread_count / 4) ), and added a comment. This was the number `9` (36 / 4) I mentioned at the talk, so we only got to it because of a heuristic. I think now after some SDK/OS updates the effect from adding more shards is less pronounced, so it mostly flatlines past some number between 5-10. A better heuristic would probably be OS specific to take the cost of lock contention into account.
> 
> Note that the increased number of shards does not increase the number of cache misses, because the shard # is determined by the filename (we don't actually have global cache misses, as the cache is designed to have only one miss per file when it's first accessed)! It's just that an increased number of shards doesn't improve performance after hitting some specific limit, so we want to find a point where we can taper it off.
> 
> It would still be definitely interesting to profile it on other high core machines with different OSes to see if it's a reasonable heuristic for those scenarios too.
I'll give it a try on Windows 10, one project here has a large codebase and some wild machines to test on.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp:156
+/// this subclass.
+class MinimizedVFSFile final : public llvm::vfs::File {
+public:
----------------
This makes only a short-lived objects, is that right? Just during the call to `CachedFileSystemEntry::createFileEntry`?

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp:94
+      // Add any filenames that were explicity passed in the build settings and
+      // that might be might be opened, as we want to ensure we don't run source
+      // minimization on them.
----------------
"might be" twice.

================
Comment at: clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp:148
+  RealFS = new ProxyFileSystemWithoutChdir(llvm::vfs::getRealFileSystem());
+  if (Service.getMode() == ScanningMode::MinimizedSourcePreprocessing)
+    DepFS = new DependencyScanningFilesystem(Service.getSharedCache(), RealFS);
----------------
arphaman wrote:
> aganea wrote:
> > Can we not use caching all the time?
> We want to have a mode where it's as close to the regular `clang -E` invocation as possible for correctness CI and testing. We also haven't evaluated if the cost of keeping non-minimized sources in memory, so it might be too expensive for practical use? I can add a third option that caches but doesn't minimize though as a follow-up patch though 
> 
Yes that would be nice. As for the size taken in RAM, it would be only .H files, right? For Clang+LLVM+LLD I'm counting about 40 MB. But indeed with a large project, that would be about 1.5 GB of .H. Although I doubt all these files will be loaded at once in memory (I'll check)

As for the usage: Fastbuild works like distcc (plain mode, not pump) so we were also planning on extracting the fully preprocessed output, not only the dependencies. There's one use-case where we need to preprocess locally, then send the preprocessed output remotely for compilation. And another use-case where we only want to extract the dependency list, compute a digest, to retrieve the OBJ from a network cache.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D63907/new/

https://reviews.llvm.org/D63907