[cfe-dev] Adding more info to builtin source types.

Thu Jan 14 15:30:37 PST 2010

On Jan 12, 2010, at 3:13 AM, Enea Zaffanella wrote:

> Douglas Gregor wrote:
>> Hello Enea,
>> On Jan 11, 2010, at 1:06 PM, Enea Zaffanella wrote:
>>> Currently, class BuiltinTypeLoc provides a single source location,  
>>> accessible via member getNameLoc(), pointing to the name of the  
>>> type specifier type (e.g., for 'unsigned int' it would return the  
>>> location of 'int'). It is quite common for coders to declare  
>>> builtin integral types using sign or width type specifiers only  
>>> (e.g., 'unsigned', 'long', 'unsigned short', etc.). In these cases  
>>> the getNameLoc() method would return an invalid source location.
>> What specific problem do you need to address? Do you just want  
>> getNameLoc() to have a valid location (which points to the sign or  
>> width if there is no actual type named), or do you want fine- 
>> grained information about where the width and sign keywords were  
>> actually located? The former could be done with a simple change  
>> where we build BuiltinTypeLocs (without increasing memory usage),  
>> while the latter requires more work (as you've done in this patch).
>
> In our specific case, the answer lies somehow in the middle.
> Besides the need for a valid source location, we do also want to know
> whether or not the type, sign and width specifiers were written
> (no matter if directly or indirectly via macro expansions).
> So, once the source location is valid, three additional bits would be
> enough for our specific purposes.
>
>
>>> The attached patch augments BuiltinTypeLoc class by adding two  
>>> source locations, one for the sign specifier and the other for the  
>>> width specifier. We have added methods to get/set the three source  
>>> locations.
>>> We also modified method getNameLoc() so as to return what is  
>>> _likely_ to be the source location of the first type specifier  
>>> token in the source code, assuming that the three specifiers are  
>>> provided in the "canonical" order:
>>> <sign> <width> <type>
>>>
>>> Please let us know if the patch is OK.
>> I have a few comments on this patch.
>> My biggest concern is that we're increasing the amount of storage  
>> needed for very common cases, and I'd like us to both try to  
>> minimize the amount of storage needed and to measure the actual  
>> cost of this change for some non-trivial header, to figure out how  
>> much extra memory its using in practice.
>> To minimize the amount of storage needed, we don't always want  
>> BuiltinTypeLoc to store 3 locations, because only the integral  
>> types actually need to store sign and width information, and that's  
>> only written in the source code occasionally. So, BuiltinTypeLoc  
>> should have a variable-length encoding dependent on the actual  
>> builtin type: types that always only have one source location  
>> (void, bool, float, double, nullptr, id, Class, SEL) should only  
>> store that location. For types that may have more than one location  
>> (e.g., the integral types), the BuiltinTypeLoc could encode which  
>> specifiers exist (in some prefix bytes) and then only have source  
>> locations for those keywords that actually show up in the source.
>
> Ok, we have made some measurements on gcc.c as an example.
> This is a 20MB source that takes up as much as 284MB of memory
> (as read from column RES of top(1)) when compiled using
> clang -cc1 -fsyntax-only.

284MB sounds really high for 20MB of source :(

> The memory size overhead due to the (current) patch is ~530000 bytes.
>
> Based on what we observed, an alternative (minimal) solution using
> 3 bits for just the integer builtin types would reduce this overhead
> to ~133000 bytes.
> [NOTE: we are assuming a 4 bytes field to store these 3 bits, because:
>  a) as far as we know, there are no free bits in a SourceLocation;
>  b) we do not want to disrupt memory alignment by using a char.]

Right, that makes sense.

> We can of course improve our current patch as you suggest (storing the
> source locations, but only when needed): this would result in an
> overhead of ~176000 bytes.

I actually like the idea of using 3 bits to say which of type/ 
signedness/width were specified in the type, then just making sure  
that the SourceLocation points to something that makes sense... the  
type specifier location if it's there, otherwise the signedness or  
width location. That gives us a lot more information in the AST  
without costing much at all.

>
>> +  SourceRange getSourceRange() const {
>> +    SourceLocation loc = getNameLoc();
>> +    return SourceRange(loc, loc);
>> +  }
>> If we're going to keep the source locations for the type/width/ 
>> sign, it'd be nice to the source range to cover the entire type.  
>> Storing the specifiers in order (with some tag saying which source  
>> location refers to which kind of keyword) would make this possible.
>
> This would be desirable, but it is (to our eyes) a bit complicated.
>
> We extract location info from DeclSpec: even though here we have a
> source range, this would probably include other stuff such as the
> storage class specifier or type qualifiers, which is not really
> relevant for builtin types. Anyway, if you think that this is OK,
> adding the source range to the BuiltinTypeLoc will be easy:
> it could be combined with any of the three solutions mentioned
> above (NOTE: having the source range, we could even drop all
> the source locations used above, but we would still need at
> least the three additional bits).
>
> As for the ordering of the specifiers: there seems to be no info
> about it in class DeclSpec.

Ah, okay. Let's not worry about the source range; a single *useful*  
location is good enough.

>> Of course, depending on how you answered the question at the top of  
>> my e-mail, all of this might be moot: if you just want  
>> BuiltinTypeLoc::getNameLoc() to point to some valid location for a  
>> integral type described only by its width or sign, that's easy to  
>> fix without any performance impact.
>>    - Doug
>
> All considered, it is mainly a matter of deciding what to do.
> To our eyes, the overhead seems not to be a real issue.

I'm a bit more paranoid about memory overhead and PCH size.

	- Doug