[cfe-dev] LLVM Dev meeting: Slides & Minutes from the Static Analyzer BoF

Tue Nov 3 22:19:29 PST 2015

On Nov 3, 2015, at 9:26 AM, Manuel Klimek <klimek at google.com> wrote:
> I’m sorry I missed this part of the discussion, but IMO, but the right answer is to build a “CIL” analog to “SIL”.  The problems with the existing Clang CFG are that:
> 
> a) it is not tested as part of IRGen, so it falls out of date.
> b) it is not a proper IR, which can be serialized/deserialized etc.  This makes it very difficult to write tests for.
> c) its “operations” or “instructions" are defined as AST nodes, so its “CILGen” stage doesn’t allow any lowering of operations.
> 
> Those are all arguments for not using the current clang CFG (which I agree with).
> 
> What are your arguments against implementing a type system on top of llvm IR (that lives on a similar level as debug info) that is not language specific per se, but allows frontends to model their language semantics and have pointers back to their AST? Do you believe such a type system would inherently be coupled to language semantics, and thus not possible to build in a generic enough (and still useful) way? Or are there other problems?

Two problems: it doesn’t solve the problem I think need to be solved, and would end up with a really awkward/inelegant solution if it could be made to work.

The problems that need to be solved:

1) You need an augmented source level type system to do the transformations that are interesting at this level.  It is the full complexity of the AST represented by Clang, as well as some minor extensions for things that get exposed by the process of lowering.  It is extremely unclear to me how you’d handle this. Using debug information doesn’t work well given that you’ll need multiple types associated with some operations.  Debug info and MDNodes in general would also be an extremely awkward way to express things.

2) You need a full suite of [lowered] source level operations that have little to do with the LLVM IR operations like getelementptr.  Even operations that are common (e.g. load and store) need to be expressed in the source level type system, not the IR type system, so they probably can’t be used (depending on the approach you use for #1).  This can be expressed in LLVM IR as intrinsics, but all the intrinsics would be language specific, so you wouldn’t achieve your language agnostic goal.

3) The only good way I know of to generate good source-level diagnostics (which should include source ranges, etc) is to point back to an AST node (Expr*, Decl*, etc) that it came from.  This mapping is obviously highly frontend-specific, and the lifetime issues managing this are also interesting.  I don’t know how this would be expressed in LLVM IR.

4) In terms of layering of the library stack, Clang should depend on LLVM, and LLVM IR is intentionally very low in the stack.

5) Almost all clients of this data structure would be source-language specific (keep in mind that the type system and operations are all language specific) so there would be little reuse anyway.  You’re right that you’d be able to reuse things like “class BasicBlock”, but that isn’t where the complexity is: things like ilist already do the interesting stuff for it, and is shared.

When thinking about this, it is important to consider the specific clients that you’d want to support.  Even things like the -Wunreachable diagnostic in clang is totally language specific (what kind of crazy language defaults variables to being uninitialized memory in the first place??).  The most interesting diagnostics in this space that Clang (and the static analyzer) want to reason about tend to be language specific as well (e.g. the objc retain/release checker).  

While it is obvious that the typestate engine for a checker like a nil dereference check can be shared, this is true regardless of the IR.    These sorts of state machines are quite simple, the complexity is in the analyses they depend on (e.g. alias analysis, which is pretty language specific) and in the code that deals with each kind of AST node/IR operation.

I’ve heard many proposed approaches to encode source level information in IR, but they all have major disadvantages, which is why none of them have been successful.  That said, it could be that you have a specific approach in mind that I haven’t envisioned.  What are you thinking of? 

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151103/c4ded5da/attachment.html>