[cfe-dev] Possible clang features

Fri Jul 25 17:27:06 PDT 2008

On Jul 23, 2008, at 6:43 PM, John Engelhart wrote:

> Here's a couple of possible ideas.  I'm sending them directly to you  
> so you can either put them on the TODO list or toss them based on  
> their merits.

Hi John,

Sorry for the late response.  I've CC'ed the cfe-dev list so that  
others can chime in as well.

>
> 1) Stores that cause GC qualifications to be dropped.
>
> This has bitten me so many times.  The gist is to catch things that  
> are roughly like:
>
> NSString *contextString = @"someContext";
> void *context = contextString;
>
> Ditto for returning values that drop a GC qualification, like:
>
> char *someFucntion(void( {
> __strong char *buffer = NSAllocationCollectable(4096, NO_SCAN);
> ...
> return(buffer);
> }

Adding checks for various GC-related properties and invariants is  
something that we think static analysis could excel at.  Several  
people have voiced interest in such checks.  Since there are a variety  
of GC-related checks, I believe the best way to start implementing  
these to is to come up with a list of specific checks to implement and  
then go from there.

To build such a task list of various GC checks to implement, probably  
the best thing to do is to start filing Bugzilla reports (feature  
requests) against the static analyzer. That way myself or anyone else  
interested in implementing a check can go to the list, see a complete  
specification for the check, and go and implement it.

> 2) Peephole optimization recommendations.
>
> Since you've got a fully parsed program, it should be possible to  
> match some generic patterns and offer up some optimization advice.   
> I'm thinking of simple things like:
>
> for(NSUInteger x = 0; x < [anArray length]; x++) {
>
> }
>
> The '[anArray length]' can 'obviously' be hoisted out of the loop,  
> but the standard compiler can't make that optimization due to the  
> dynamic nature of method dispatch.  Your tool, however, can make the  
> recommendation that the programmer take a look at it and offer  
> something like the following advice:
>
> Possibly rewrite this for() loop as:
>
> NSUInteger anArrayLength = [anArray length];
> for(NSUInteger x = 0; x < anArrayLength; x++) {
>
> }
>
> You could even have different levels of 'optimization  
> aggressiveness'.  Another common trick I use is to replace '[anArray  
> objectAtIndex:X]' with 'CFArrayGetValueAtIndex(anArray, X)'.  Inside  
> loops, this can often be a pretty big performance win.
>
> Heh, actually, some of these things would be ideal inside Shark.  It  
> would have easy access to the hotspots, and it could offer this kind  
> of 'code cruftifying' optimization advice only in hotspot areas.
>
> I think you get the idea though.  There's definitely a couple of  
> 'low hanging fruit' items that should be trivial to implement.   
> clang is the perfect place to put them in where they can be  
> consistently applied and identified.

This is a great example of automatic code refactoring combined with  
static analysis and profiling information.  Performing "unsound  
optimizations" by suggesting changes to the source code is something  
I'd like to see more of.  It's a little risky, but if done right it  
could yield some huge performance improvements for some programs,  
especially those written in Objective-C (where as you said the dynamic  
nature of the language makes it difficult for the compiler to do the  
optimizations you mentioned).

Although we don't have an extensive refactoring library in clang right  
now, we do have some pieces to support refactoring, including a  
textual code rewriter that can rewrite fragments of the code in place  
(preserving comments, macros, etc).  Adding more high-level interfaces  
to support refactoring applications like these would be a great  
contribution to clang, and my design for the static analysis library  
in Clang is to use it for a variety of applications (not just finding  
bugs).

This would actually be a really interesting project to work on if you  
(or others) were interested.

> 3) Parsed code export.
>
> This one is sort of 'pie in the sky' idea based on a need of mine.   
> I mention it because I'm pretty sure you've got the bulk of the  
> machinery in place to do all your other checks.
>
> For one of my projects, RegexKit (the non-Lite version), I ended up  
> writing a documentation system to go along with it (http://regexkit.sourceforge.net/Documentation/index.html 
> ).  HeaderDoc just didn't cut it for what I needed.  It still  
> follows the header doc /*! @TAG */ style to a large degree, but I  
> ended up tweaking a few things here and there as the need came up.   
> I think you can feed RegexKit's headers in to header doc unmodified  
> and still get something reasonable out the other side.  It wouldn't  
> take much to write some kind of 'scrubber' script so it was 100%  
> HeaderDoc compatible.
>
> This wasn't something that was planned out, it just sort of grew  
> organically as needed.  One of the first needs was to be able to  
> generate a Table Of Contents from all the source headerdoc commented  
> files.  Other stuff got bolted on from there, the end result being  
> about what you'd expect from one hack slapped on top of another hack.
>
> The basic idea is to stuff everything in to an SQLite database.   
> There's a (very messy) perl script that 'parses' header files.   
> Parses is used only in the loosest possible terms because what it  
> really is is a bunch of regex pattern matches that match things that  
> are 'close enough'.  The headerdoc comments are easy to find (/\/\*! 
> (.*?)\*\//), and some of the other bits and pieces are fairly easy  
> to find, such as method and function prototypes.  Since it's really  
> just a bunch of regex patterns, it can be confused easily and is  
> sort of fragile in the face of big changes.
>
> What would be nice is for a real grammar to parse the header and  
> spit out the parsed structure in some kind of 'easy to use' format.   
> You could then write a perl script that would scoop up the easy to  
> use output and do whatever it needed to.
>
> For example, a method prototype would get decoded in to it's basic  
> parts (I'm completely making this up as I type this, so don't expect  
> anything reasonable).
>
> - (NSArray *)componentsSeparatedByString:(NSString *)separator;
>
> This is a NSString instance method.  To a full blown parser, it's  
> trivial to separate out the different parts.  It returns a type of  
> 'NSArray *', and the parser knows that NSArray is a class.   
> 'separator' is an argument that is a 'NSString *' type, and NSString  
> is a class.
>
> Basically, some kind of output where all that heavy lifting is done  
> for you.  It's also handy to have back references to where something  
> was defined, such as a class or typedef.
>
> In my particular system, this is 'approximately' what happens,  
> modulo the fact that it uses a couple of heuristics instead of the  
> actual syntax structure to derive some of its information.  The  
> 'parse' script extracts the information and stuffs it in to an  
> SQLite database.  HTML generation happens only after all the .h  
> files have been read in.
>
> HTML generation uses the 'parsed' structure to assist in formatting  
> things inside the HTML.  Because everything is inside a database,  
> when it comes time to output a 'pretty' method prototype in the  
> HTML, it can scan the types for types that exist in the database and  
> automagically place a link around that type that points to the  
> documentation for that type. Other fancy bits are also possible,  
> such as pretty the arguments out in italic style.  Here's an example  
> of the HTML:
>
> <div class="block method">
> <div class="section name"><a  
> name="NSMutableString_RegexKitLiteAdditions__- 
> replaceOccurrencesOfRegex:withString:options:range:error 
> :">replaceOccurrencesOfRegex:withString:options:range:error:</a></div>
> <div class="section summary">Replaces all occurrences of the regular  
> expression <span class="argument">regex</span> using <span  
> class="argument">options</span> within <span class="argument">range</ 
> span> with the contents of <span class="argument">replacement</span>  
> string after performing capture group substitutions, returning the  
> number of replacements made.</div>
> <div class="signature"><span class="hardNobr">- (NSUInteger)<span  
> class="selector">replaceOccurrencesOfRegex:</span>(NSString *)<span  
> class="argument">regex</span></span> <span class="hardNobr"><span  
> class="selector">options:</span>(<a href="#RKLRegexOptions"  
> class="code">RKLRegexOptions</a>)<span class="argument">options</ 
> span></span> <span class="hardNobr"><span  
> class="selector">withString:</span>(NSString *)<span  
> class="argument">replacement</span></span> <span  
> class="hardNobr"><span class="selector">range:</span>(NSRange)<span  
> class="argument">range</span></span> <span class="hardNobr"><span  
> class="selector">error:</span>(NSError **)<span  
> class="argument">error</span>;</span></div>
>
> It's a bit messy raw, but the basics are there.  CSS is used  
> extensively to control the visual aspect.  The 'hardNobr' class is  
> '.hardNobr { white-space: nowrap; }', which forces the rendered  
> output to not be broken up.  In this case, it's used to keep  
> 'logically similar' elements together during word breaking.  It  
> looks kinda ugly to break on the space in '(NSString *)'.  :)
>
> Anyways, I think you get the idea.  By understanding the underlying  
> syntactical structure, it's much easier to automatically reformat  
> things in a pleasing documentation friendly way.  It also means  
> someone can declare the method any way they want, with whatever  
> whitespace/newline formatting they want, and I can still squeeze  
> things back in to a documentation consistent form.
>
> I found this to be of particular use for 'enum' and 'struct' like  
> definitions.  As an example, for enums, it becomes trivial to use a  
> table for the formating.  The identifier goes in to one table  
> column, and the identifier constant goes in to another.  This lays  
> out things in a neatly aligned and visually consistent fashion, not  
> just a big '<pre></pre>' block and hope for the best.
>
> An unexpected benefit to doing my documentation this way was that  
> when 10.5 came out with DocSet integrated documentation, it was just  
> a matter of writing up a perl script that extracted the information  
> from the database and output it in to a form docset understands.  I  
> literally had working, Xcode integrated docset documentation in  
> under a day.
>
> Again, this is 'pie in the sky' type of stuff.  It's one of those  
> things that you think 'Oh, that's trivially easy!' and you've got  
> some rough code in about 20 minutes, or it's big project.  There's  
> lots of neat things (like this documentation example) that you can  
> do if you have access to an easy to use 'parsed structure' output  
> where all the heavy lifting of parsing the file has been done for you.

Your point about 'parsed code export' I believe touches on a larger  
issue: people want to build a variety of tools that reason about or  
manipulate source code.  This has been hard in C/C++/Objective-C space  
because either the frontend technology is intertwined into the  
implementation of a larger component (e.g., a compiler) or the ASTs  
(or whatever other structured code representation) can not be  
persistently stored and used by other clients.

Clang is designed to obviate both issues.  First, Clang is built as a  
set of libraries.  The lexer, preprocessor, parser, type checking,  
ASTs --- these are all represented as libraries, modules that can be  
easily linked into any tool that wants to use them, be it a command  
line compiler driver, and IDE, or some other tool.  This design is  
intentional, and it follows the same guiding principles as the rest of  
LLVM.  By having a library based design people can use the pieces they  
want to build their own tools.

We are also working on making Clang's internal representation of its  
ASTs persistent.  We have support for pretty-printing and textual  
dumping (some of it is not completely implemented, but it's getting  
there) so that clients can view a dump of some of Clang's internal  
data structures right from the command line.  Other tools that wish to  
use information from Clang but not use its libraries directly can  
potentially use such output.  You can also define that output in any  
way you wish by adding an appropriate ASTConsumer to the clang driver.

We've also built serialization support into Clang to serialize its  
ASTs out to disk.  This isn't 100% there, but it will provide the  
basis for PCH support in Clang, and the static analyzer will also use  
serialized ASTs to perform inter-procedural analysis across files.   
Such persistent ASTs could be inserted into a database, sent across a  
network connection, etc.

Your example of building an automatic documentation generating system  
(e.g., Headerdoc or doxygen) is a great example: I think this would be  
a perfect example of a tool that reuses the Clang libraries for a  
different purpose other than compilation.  The current ASTConsumer  
interface used by the Clang driver would actually be a great place to  
start if you wanted to build such a tool, as clients are essentially  
streamed the AST of a source file and they can do what they like with  
it.  As we bring up more support for persistent ASTs on disk  
(especially for use with inter-procedural analysis), a documentation  
generating tool could use those serialized ASTs to perform accurate  
cross-referencing of function/methods/etc. across files.  This allows  
the documentation system to generate really accurate information about  
type hierarchies, implementations of classes, macro information, you  
name it.

In a similar way, I believe a whole cadre of other tools could be  
built.  We are really excited about Clang because we see it as an  
enabling technology to build great source-level tools for C/C++/ 
Objective-C.

BTW, I like the idea of a Clang-based automatic documentation system  
so much that I added it to our list of possible projects for people to  
do on the Clang website.

- Ted