[PATCH] D51314: Parse compile commands lazily in InterpolatingCompilationDatabase

Tue Aug 28 01:36:22 PDT 2018

sammccall added a comment.

Thanks for finding this problem, this fix *mostly* looks good (though I think we can probably drop memoization).

I'm a bit uncomfortable about the places where we need the type, because this is the only thing forcing us to parse before we've picked a command to transfer, and the number of commands we need to parse is data-dependent and hard to reason about.

Let me think about this a little - I suspect slightly more invasive changes (change the concept of type, tweak the heuristics, or do a "lightweight parse" to get the type) might make this cleaner and performance more predictable.

================
Comment at: lib/Tooling/InterpolatingCompilationDatabase.cpp:440
       Result.emplace_back(std::move(Command));
-      if (Result.back().Type == types::TY_INVALID)
-        Result.pop_back();
----------------
ilya-biryukov wrote:
> We can't look at 'Type' at this point anymore, because it needs parsing of TranserableCommands. Not sure what's the best way to replace it. @sammccall, any ideas?
> 
So filtering out this has a couple of effects
 - it's a performance optimization (don't bother indexing filenames for useless files). We don't need this
 - it prevents a TY_INVALID command being chosen for transfer. I'm not sure whether this would occur often enough to be a problem in practice.

================
Comment at: lib/Tooling/InterpolatingCompilationDatabase.cpp:133
+    assert(TraitsComputed && "calling transferTo on moved-from object");
+    const CommandTraits &T = getTraits();
+    CompileCommand Result = T.Cmd;
----------------
I think you're overthinking things with the memoization here (of course I say this as the person who underthought it!)

AIUI, the problem is that *eagerly* parsing all the compile commands takes 3x as long as reading them, which hurts startup time with big `compile_commands.json`.

But I think we can afford to just parse them when `transferTo` is called, without memoization. (Remember we only hit this code path when querying a file *not in the CDB*, so it should never get called in a tight loop).

The benefit of slightly reducing the latency of`getCompileCommand` for unknown files when we happen to pick the same template file for the second time... it's unclear to me, and the code would be easier to follow without it.

================
Comment at: lib/Tooling/InterpolatingCompilationDatabase.cpp:159
+  llvm::StringRef filename() const {
+    assert(TraitsComputed && "calling filename on moved-from object");
+    return OriginalCmd.Filename;
----------------
Is this so important to dynamically check? Most types don't.

================
Comment at: lib/Tooling/InterpolatingCompilationDatabase.cpp:171
+  // compute, so we only compute on first access.
+  struct CommandTraits {
+    // Flags that should not apply to all files are stripped from CommandLine.
----------------
Traits is a bit vague, and a bit template-nightmare-y!
maybe ParsedCommand?

================
Comment at: lib/Tooling/InterpolatingCompilationDatabase.cpp:383
       S.Preferred = PreferredLanguage == types::TY_INVALID ||
-                    PreferredLanguage == Commands[S.Index].Type;
+                    PreferredLanguage == Commands[S.Index].type();
       S.Points = Candidate.second;
----------------
I think this is going to force parsing of all candidates that get any points at all, with a flat directory structure this could be quite a lot :-(

================
Comment at: lib/Tooling/InterpolatingCompilationDatabase.cpp:383
       S.Preferred = PreferredLanguage == types::TY_INVALID ||
-                    PreferredLanguage == Commands[S.Index].Type;
+                    PreferredLanguage == Commands[S.Index].type();
       S.Points = Candidate.second;
----------------
sammccall wrote:
> I think this is going to force parsing of all candidates that get any points at all, with a flat directory structure this could be quite a lot :-(
ah, now I see that the memoization also allows us to pretend that this is an eagerly computed value, without considering exactly when it's computed.

I'm not sure I like this if we do consider it performance sensitive - it obfuscates exactly which set of commands we parse, it would be nice if we were upfront about this.

Repository:
  rC Clang

https://reviews.llvm.org/D51314