[cfe-dev] [analyzer][tooling] Architectural questions about Clang and ClangTooling

Tue Apr 28 15:43:32 PDT 2020

On Tue, Apr 28, 2020 at 11:53 PM David Blaikie <dblaikie at gmail.com> wrote:

> (+Sam, who works on clang tooling)
>
> On Tue, Apr 28, 2020 at 10:38 AM Artem Dergachev <noqnoqneo at gmail.com>
> wrote:
>
>> On 4/28/20 6:23 PM, David Blaikie wrote:
>> > On Tue, Apr 28, 2020 at 3:09 AM Artem Dergachev via cfe-dev
>> > <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>> >
>> >     Hey!
>> >
>> >     1. I'm glad that we're finally trying to avoid dumping PCH-s on
>> disk!
>> >
>> >     2. As far as I understand, dependencies are mostly about Clang
>> binary
>> >     size. I don't know for sure but that's what I had to consider when
>> >     I was
>> >     adding libASTMatchers into the Clang binary a few years ago.
>> >
>> >     3. I strongly disagree that JSON compilation database is "just
>> >     right for
>> >     this purpose". I don't mind having explicit improved support for
>> >     it but
>> >     I would definitely prefer not to hardcode it as the only possible
>> >     option. Compilation databases are very limited and we cannot drop
>> >     projects or entire build systems simply because they can't be
>> >     represented accurately via a compilation database. So I believe that
>> >     this is not the right solution for CTU in particular. Instead, an
>> >     external tool like scan-build should be guiding CTU analysis and
>> >     coordinate the work of different Clang instances so that to abstract
>> >     Clang away from the build system.
>> >
>> >
>> > What functionality do you picture the scan-build-like tool having that
>> > couldn't be supported if that tool instead built a compilation
>> > database & the CTU/CSA was powered by the database? (that would
>> > separate concerns: build command discovery from execution, and make
>> > scan-build-like tool more general purpose, rather than specific only
>> > to the CSA)
>>
>> Here are a few examples (please let me know if i'm unaware of the latest
>> developments in the area of compilation databases!)
>>
>> - Suppose the project uses precompiled headers. In order to analyze a
>> file that includes a pch, we need to first rebuild the pch with the
>> clang that's used for analysis, and only then try to analyze the file.
>> This introduces a notion of dependency between compilation database
>> entries; unless entries are ordered in their original compilation order
>> and we're analyzing with -j1, race conditions will inevitably cause us
>> to occasionally fail to find the pch. I didn't try to figure out what
>> happens when modules are used, but i suspect it's worse.
>
>
> Google certainly uses clang tooling, with a custom compilation database on
> a build that uses explicit modules - I believe the way that's done is to
> ignore/strip the modules-related flags so the clang tooling uses
> non-modules related compilation.
>
Yeah. In fact we put the build system into a mode where it generates
compile commands for a non-modules build, I'm not sure whether it's always
easy to do this given a modular compile command.

I suspect to make c++20 modular builds work with tools (and we must), we'll
need an extension or peer to compile_commands.json that build systems can
use to define the modules in a project. That will at least give tools what
they need to maintain a module cache.

> But I could be wrong there. You could do some analysis to see any
> inputs/outputs - or reusing the existing outputs in the original build.
>
The problem with reusing existing outputs (at least serialized ASTs like
PCH or modules) is because there's no stable format, you need to
version-lock your tool to your compiler (which must be clang!) to ensure
compatibility.
I think this is what Artem was alluding to by "with the clang that's used
for analysis".

> But if analysis
>> is conducted alongside compilation and the build system waits for the
>> analysis to finish like it waits for compilation to finish before
>> compiling dependent translation units, race conditions are eliminated.
>> This is how scan-build currently works: it substitutes the compiler with
>> a fake compiler that both invokes the original compiler and clang for
>> analysis. Of course, cross-translation-unit analysis won't be conducted
>> in parallel with compilation; it's multi-pass by design. The problem is
>> the same though: it should compile pch files first but there's no notion
>> of "compile this first" in an unstructured compilation database.
>>
> FWIW our internal database plugin (not compile_commands.json) prepares the
inputs for a target when its compile command is queried. This works as long
as your build system doesn't delete files or give them different content
based on what you're building.

- Suppose the project builds the same translation unit multiple times,
>> say with different flags, say for different architectures. When we're
>> trying to lookup such file in the compilation database, how do we figure
>> out which instance do we take? If we are to ever solve this problem, we
>> have to introduce a notion of a "shipped binary" (an ultimate linking
>> target) in the compilation database and perform cross-translation-unit
>> analysis of one shipped binary at a time.
>>
>
> I believe in that case the compilation database would include both
> compilations of the file - and presumably for the static analyzer, it would
> want to build all of them (or scan-build would have to have some logic for
> filtering them out/deciding which one is the interesting one - same sort of
> thing would have to be done on the compilation database)
>
>
>> - There is a variety of hacks that people can introduce in their
>> projects if they add arbitrary scripts to their build system. For
>> instance, they can mutate contents of an autogenerated header in the
>> middle of the build. We can always say "Well, you shouldn't do that",
>> but people will do that anyway. This makes me believe that no purely
>> declarative compilation database format will ever be able to handle such
>> Turing-complete hacks and there's no other way to integrate analysis
>> into build perfectly other than by letting the build system guide the
>> analysis.
>>
>
> Yep, a mutating build where you need to observe the state before/after
> such mutations, etc, not much you could do about it. (& how would CTU SA
> work in that sort of case? You have to run the whole build multiple times?)
>
> Clang tools are essentially static analysis - so it seems weird that we
> have two different approaches to static analysis discovery/lookup in the
> Clang project, but not wholely unacceptable, potentially different goals,
> etc.
>
>
>> I'm also all for separation of concerns and I don't think any of this is
>> specific to our static analysis.
>>
>> >     On 4/28/20 11:31 AM, Endre Fülöp via cfe-dev wrote:
>> >     >
>> >     > Hi!
>> >     >
>> >     > Question:
>> >     >
>> >     > Why is the dependency on ClangTooling ill-advised inside ClangSA
>> >     (also
>> >     > meaning the Clang binary) itself ?
>> >     >
>> >     > Context:
>> >     >
>> >     > Currently I am working on an alternative way to import external TU
>> >     > AST-s during analysis ( https://reviews.llvm.org/D75665 ).
>> >     >
>> >     > In order to produce AST-s, I use a compilation database to
>> >     extract the
>> >     > necessary flags, and finally use ClangTool::buildAST.
>> >     >
>> >     > I am aware that I have other options for this as well (like
>> >     manually
>> >     > coding the compdb handling for my specific case for the
>> >     >
>> >     > first step, and maybe even dumping ASTs as pch-s into an in-memory
>> >     > buffer), but still consuming JSONCompilationDatabase
>> >     >
>> >     > is just too convenient. I would not want to introduce another
>> >     format
>> >     > when compilation database is just right for this purpose.
>> >     >
>> >     > Elaboration:
>> >     >
>> >     > While I understand that introducing dependencies has its
>> downsides,
>> >     > but not being able to reuse code from Tooling is also not ideal.
>> >     >
>> >     > I would very much like to be enlightened by someone more
>> >     familiar with
>> >     > architectural decision already made why this is the case,
>> >     >
>> >     > and optionally how I could proceed with my efforts so that I can
>> >     come
>> >     > up with the most fitting solution i.e. not a hack.
>> >     >
>> >     > Thanks,
>> >     >
>> >     > Endre Fülöp
>> >     >
>> >     >
>> >     > _______________________________________________
>> >     > cfe-dev mailing list
>> >     > cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>> >     > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >
>> >     _______________________________________________
>> >     cfe-dev mailing list
>> >     cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>> >     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200429/d63e8f7b/attachment.html>