[cfe-dev] LLVM Dev meeting: Slides & Minutes from the Static Analyzer BoF

Wed Nov 4 12:54:46 PST 2015

On Wed, Nov 4, 2015 at 11:57 AM Anna Zaks <ganna at apple.com> wrote:

>
> On Nov 4, 2015, at 11:22 AM, Manuel Klimek <klimek at google.com> wrote:
>
>
>
> On Wed, Nov 4, 2015 at 11:04 AM Anna Zaks <ganna at apple.com> wrote:
>
>> On Nov 4, 2015, at 10:25 AM, Manuel Klimek <klimek at google.com> wrote:
>>
>> On Tue, Nov 3, 2015 at 10:19 PM Chris Lattner <clattner at apple.com> wrote:
>>
>>> On Nov 3, 2015, at 9:26 AM, Manuel Klimek <klimek at google.com> wrote:
>>>
>>> I’m sorry I missed this part of the discussion, but IMO, but the right
>>> answer is to build a “CIL” analog to “SIL”.  The problems with the existing
>>> Clang CFG are that:
>>>
>>>>
>>>> a) it is not tested as part of IRGen, so it falls out of date.
>>>> b) it is not a proper IR, which can be serialized/deserialized etc.
>>>> This makes it very difficult to write tests for.
>>>> c) its “operations” or “instructions" are defined as AST nodes, so its
>>>> “CILGen” stage doesn’t allow any lowering of operations.
>>>>
>>>
>>> Those are all arguments for not using the current clang CFG (which I
>>> agree with).
>>>
>>> What are your arguments against implementing a type system on top of
>>> llvm IR (that lives on a similar level as debug info) that is not language
>>> specific per se, but allows frontends to model their language semantics and
>>> have pointers back to their AST? Do you believe such a type system would
>>> inherently be coupled to language semantics, and thus not possible to build
>>> in a generic enough (and still useful) way? Or are there other problems?
>>>
>>>
>>> Two problems: it doesn’t solve the problem I think need to be solved,
>>> and would end up with a really awkward/inelegant solution if it could be
>>> made to work.
>>>
>>> The problems that need to be solved:
>>>
>>> 1) You need an augmented source level type system to do the
>>> transformations that are interesting at this level.  It is the full
>>> complexity of the AST represented by Clang, as well as some minor
>>> extensions for things that get exposed by the process of lowering.  It is
>>> extremely unclear to me how you’d handle this. Using debug information
>>> doesn’t work well given that you’ll need multiple types associated with
>>> some operations.  Debug info and MDNodes in general would also be an
>>> extremely awkward way to express things.
>>>
>>
>> I agree this is needed, and that it would need a well thought out design.
>> I'd also think some explorative coding would be necessary to identify how
>> we'd do that (or whether it's even possible).
>>
>>
>>> 2) You need a full suite of [lowered] source level operations that have
>>> little to do with the LLVM IR operations like getelementptr.  Even
>>> operations that are common (e.g. load and store) need to be expressed in
>>> the source level type system, not the IR type system, so they probably
>>> can’t be used (depending on the approach you use for #1).  This can be
>>> expressed in LLVM IR as intrinsics, but all the intrinsics would be
>>> language specific, so you wouldn’t achieve your language agnostic goal.
>>>
>>
>> I'd expect the LLVM IR to be annotated with the types; I agree that we'd
>> need to annotate basically all generated IR with the types, and that that
>> would be a substantial effort.
>>
>>
>>> 3) The only good way I know of to generate good source-level diagnostics
>>> (which should include source ranges, etc) is to point back to an AST node
>>> (Expr*, Decl*, etc) that it came from.  This mapping is obviously highly
>>> frontend-specific, and the lifetime issues managing this are also
>>> interesting.  I don’t know how this would be expressed in LLVM IR.
>>>
>>
>> I'd expect to have a mechanism in the new higher level type system to
>> point back at frontend specific nodes, if the frontend choses to do so. I
>> agree that lifetime would be interesting, but that seems not
>> insurmountable, but rather a straight forward engineering problem (I'm much
>> more concerned about whether it's possible to define a higher level type
>> system that is language agnostic enough to fit LLVM, but is expressive
>> enough to fit the use cases of the frontons).
>>
>>
>> The analysis code that uses those AST nodes would also be language
>> specific. At SIL level, we sometimes look up the AST nodes and make
>> analysis decisions based on them (in addition to using AST for determining
>> diagnostic locations). The clang static analyzer also uses AST extensively.
>> Utilizing the AST allows us to have language specific heuristics which are
>> very important for dealing with both false positives and false negatives.
>>
>>
>>
>>> 4) In terms of layering of the library stack, Clang should depend on
>>> LLVM, and LLVM IR is intentionally very low in the stack.
>>>
>>
>> Agreed, and it should stay that way. I'd expect 2 pieces to live in LLVM:
>> - a higher level type system that we can annotate IR with, and which will
>> be kept sound by a subset of the passes (especially the early ones)
>> - interfaces in that type system for frontend specific callbacks, so
>> frontend can store backrefs to their AST nodes for accurate diagnostics if
>> they chose to do so
>> Do you think that would already contradict the layering requirements?
>>
>>
>>> 5) Almost all clients of this data structure would be source-language
>>> specific (keep in mind that the type system and operations are all language
>>> specific) so there would be little reuse anyway.  You’re right that you’d
>>> be able to reuse things like “class BasicBlock”, but that isn’t where the
>>> complexity is: things like ilist already do the interesting stuff for it,
>>> and is shared.
>>>
>>> When thinking about this, it is important to consider the specific
>>> clients that you’d want to support.
>>>
>>
>> Agreed.
>>
>>
>>>  Even things like the -Wunreachable diagnostic in clang is totally
>>> language specific (what kind of crazy language defaults variables to being
>>> uninitialized memory in the first place??).  The most interesting
>>> diagnostics in this space that Clang (and the static analyzer) want to
>>> reason about tend to be language specific as well (e.g. the objc
>>> retain/release checker).
>>>
>>
>> On the other hand, we have some evidence for checks that are less
>> language specific (or have very generic components). Thread safety analysis
>> comes to mind. We're basically building the same things for all languages;
>> there are language specific pieces we need the frontons to generate, but
>> the gist of the issue is a lower level type checking system.
>>
>>
>> I would argue that the thread safety analysis you are talking about were
>> specifically designed to target several languages, which makes them generic
>> by design. (For example, threading models of Go and C++ are very different.)
>>
>> Once you have a static analysis framework, it is very important to make
>> writing checks for it easy. There are a lot of checks to be written and
>> many more people are interested in writing checks than in developing the
>> core. Lowering the barrier for entry there is important. Asking checker
>> writers to understand another type system and map their checks onto another
>> language increases the barrier for entry, which is already very high.
>>
>
> I agree that making writing checks as easy as possible is one of the core
> considerations; on the other hand, making it possible for people to do what
> they need is another important considerations.
> I believe there are many considerations for when people create static
> analysis checkers:
> 1. for example, it's easy to find researchers who have based their
> research / applications on LLVM IR, especially for full program / code base
> analysis; so I believe there is a real need for that level of infrastructure
>
>
> Working with SIL-like representation would be much easier than consuming
> the AST. Once this option is available, I am sure more researchers would
> use it. A lot of them mainly care about producing rapid results and walking
> the entire clang AST is not speedy to implement. Once you have a smaller
> IR, that problem is solved. Having that IR map back onto the AST is an
> add-on that can be used when needed.
>

I don't yet understand how that would be different for LLVM IR based
analysis.

> 2. if you create a SIL-equivalent for C++, you are asking people to learn
> another abstraction in addition to the C++ AST; even for the static
> analyzer you have to learn the static analyzer CFG, which has its own
> idiosyncrasies
>
>
> (The current clang static analyzer checker writers do not need to know
> about the CFG.  They are exposed to a limited set of APIs that require
> limited understanding of symbolic execution and clang ASTs.
>

It seems to me like checkers do look at AST nodes. I'm not sure I agree
that this severely limits the exposure to the AST.

> http://clang.llvm.org/doxygen/CheckerDocumentation_8cpp_source.html)
>
> The main point here is that people will have to write checks that work on
> a different language than what they care about; for example, the type
> system would be different. When helping people with writing their first
> checkers we often hear mentioning of specific C/C++ constructs and types.
> They'd have to generalize their checks to the other type system before they
> can proceed.
>

I agree if you take a single language that is probably the right trade-off.

On the other hand, now we have to write a SIL for every frontend, and then
have everybody rewrite their checkers (even if they are in principle
language independent) for each frontend-SIL.

The alternative is that somebody writing a checker has to learn how the
language maps to a language-independent type system, or that we provide
good abstractions on the frontend-level for how to do those transformations.
I'm not claiming I know that this is even possible, but I do think it is
worth considering, given the upsides it has across the LLVM platform.

> 3. while the heuristic based approaches are interesting (we're going after
> them ourselves with the pure AST based clang-tidy checks after all), we
> also want high precision static analysis
>
>
> Augmenting high precision static analysis with AST-based heuristics is
> very valuable. In most cases, we are going after undecidable problems and
> achieving good false positive to false negative ratio is tough.
>

Sure. That can be achieved by references and interfaces back into the
frontend specific parts, or by expressing those heuristics in the type
system.

4. there are many ways to make the barrier of entry more accessible by
> creating higher level abstractions that people understand, instead of
> limiting the core of the infrastructure
>
>
> See the answer to #2. Also, I do not think we would be limiting the core
> of the infrastructure with SIL-like approach.
>

Well, it would be limiting in that it would not help outside the domain of
a single language.

>
> That said, I fully agree that we'll always need different levels of "easy
> to write" vs. "powerful" for static analysis.
>
>
>
>> Another thing we'd like is that we can use more static analysis for
>> optimization (for example, devirtualization; iirc in your talk you
>> mentioned you use SIL for that). I am not an expert here, so I'll believe
>> you if you say this is not possible :) On the other hand, some of the ideas
>> behind dependent type systems look to me like they could be useful in llvm
>> (dependent type parameters, for example), and seem non-trivial enough, so
>> that re-implementing them in a new IR for each targeted language would be
>> very costly, when instead it could make LLVM as a platform interesting to a
>> new set of applications.
>>
>>
>>> While it is obvious that the typestate engine for a checker like a nil
>>> dereference check can be shared, this is true regardless of the IR.
>>>  These sorts of state machines are quite simple, the complexity is in the
>>> analyses they depend on (e.g. alias analysis, which is pretty language
>>> specific) and in the code that deals with each kind of AST node/IR
>>> operation.
>>>
>>
>> Yep, I think the main challenge will be to come up with a type system on
>> top of LLVM IR that is expressive enough so frontends can map their types
>> to it in a way that we can write language independent analysis passes. The
>> proof is in the pudding, of course, but I'm not (yet) convinced it's
>> impossible :D
>>
>>
>>> I’ve heard many proposed approaches to encode source level information
>>> in IR, but they all have major disadvantages, which is why none of them
>>> have been successful.  That said, it could be that you have a specific
>>> approach in mind that I haven’t envisioned.  What are you thinking of?
>>>
>>
>> I agree that encoding source level information is probably not enough;
>> I'd expect that we actually need to encode a language independent type
>> system on top of LLVM IR; so far I'm not aware somebody has tried that - if
>> that's the case, I'd be interested to learn more about the attempts and
>> their shortcomings.
>>
>> Cheers,
>> /Manuel
>>
>>
>>>
>>> -Chris
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151104/0e777a05/attachment.html>