[cfe-dev] my experience with clang

Sat Jan 12 20:42:35 PST 2008

On Jan 9, 2008, at 2:42 PM, Nuno Lopes wrote:

> The PHP interpreter has the following function:
> int zend_parse_parameters(int num_args, char *type_spec, ...);
>
> it is usually used like this:
> zend_parse_parameters(ZEND_NUM_ARGS(), "s|l", &str, &str_len,  
> &number);

OK.  This example really helps.

> The problem is that the number and type of arguments depend on the  
> format string. In this case it receives a string (str + length) and  
> a long (optional). No compiler is currently able (AFAIK) to check if  
> the function is called correctly. Also, 'number' might not be  
> initialized, while str and str_len do (if the function doesn't  
> return FAILURE).

So if I understand correctly, zend_parse_parameters has the following  
postcondition:

"return value" != FAILURE   =>    str == INITIALIZED, str_len ==  
INITIALIZED,

"return value" == FAILURE   =>   str == UNINITIALIZED, str_len ==  
UNINITIALIZED

What you would like to do is expand the "uninitialized values"  
analysis to take into account the "return value" so that you can flag  
possible bad uses of "str" and "str_len"?

> I implemented a simple checker with clang to verify the parameter  
> types. I mentioned that I need to port it to the liveness analyzer

I think you mean the "uninitialized values" analyzer, not the  
"liveness analyzer."  They are two completely different concepts.   
Liveness determines if the value in a variable will ever be used after  
a given point.  Uninitialized values determines if the value in a  
variable is a garbage, regardless of whether or not the value will be  
used later.  Further, in an optimizing compiler both analyses are a  
form of a dataflow analysis, except that "uninitialized values" is a  
forward dataflow analysis (information propagates forward in the CFG)  
and "liveness" is a reverse dataflow analysis (information propagates  
backwards in the CFG).  This is an implementation detail, but it  
doesn't illustrate that they are two separate concepts.

> because I want to check if the parameters after the '|' are used  
> before initialization

Let me see if I understand what you mean.  After a call to  
"zend_parse_parameters", you want to track the possible initialized/ 
uninitialized state of the "str" and "str_len" arguments (which  
depends on the "return value" of zend_parse_parameters).  If you use  
"str" or "str_len" (or whatever other variables were used as  
arguments) if they could be in the "uninitialized" state, you want to  
flag an error.  Is this what you mean?

> and if the ones before are not initialized unnecessarily.

This one I'm not certain what you mean.  I'm not certain what you mean  
by "not initialized unnecessarily."

> I doubt that anytime soon compilers will be able to analyze these  
> varargs functions automatically (well, you could try to do use some  
> heuristics, like searching for a switch, but..), so my idea was to  
> expose some kind of API to the programmers to allow them to specify  
> some arbitrary function to validate the arguments.
> GCC supports the following:
> void my_printf(const char *format, ...)   
> __attribute__((format(printf, 1, 2)));
>
> but GCC only supports the printf and scanf functions. My idea was to  
> generalize this, by allowing the user to specify some function  
> (without touching in the compiler's code).
> While the idea seems fairly acceptable, I don't have any syntax  
> proposal.

There was some interesting work on ESC/Java on providing powerful,  
logic-based annotations to functions and classes.  The annotations  
were injected in comments, and a hacked Java parser would read those  
comments (similar to parsing Javadoc comments) and use those  
annotations to describe pre- and postconditions for functions/classes/ 
whatever.  Some of the preconditions/postconditions one could  
associate with a function were extremely expressive; the downside is  
that they could require an expensive theorem prover to actual verify  
that the conditions would hold.  On the other hand, the syntax of  
these annotations was actually not all that gross, although adding  
parser support for comment-based annotations for C/C++ is much more of  
a challenge because these languages are far messier in their syntactic  
structure.  Adding attributes to support such stuff might be  
reasonable as well, as long as the logic-based annotations were  
embedded in a quoted string within the attributes.

I'm not proposing, however, that we implement ESC/Java for clang,  
although a subset of those features might be extremely useful, as it  
is better to encode such properties concerning the contract associated  
with a function's interface in the actual source code (e.g. header  
files) instead of hardwiring such knowledge into a specific tool.   
This not only allows the tool to become more extensible as more code  
is annotated, but also means that the knowledge is more portable, and  
doesn't die out when a specific tool dies out.

The other thing that I would like to mention is that the particular  
property you are describing is a little more than extending a flow- 
sensitive uninitialized values analysis.  Because the uninitialized/ 
initialized state of "str" and "str_len" depends on the return value  
of zend_parse_parameters, it almost inherently becomes a path- 
sensitive property if you want to check it with any real precision.   
We will likely extend the uninitialized values analysis to work in the  
new path-sensitive dataflow engine that we are building; in that case  
adding such information might actually be pretty easy and should give  
you the precision that you need to not spit out too much noise to the  
user.