[cfe-dev] Adding more info to builtin source types.

Tue Jan 12 03:13:37 PST 2010

Douglas Gregor wrote:
> Hello Enea,
> 
> On Jan 11, 2010, at 1:06 PM, Enea Zaffanella wrote:
>> Currently, class BuiltinTypeLoc provides a single source location, 
>> accessible via member getNameLoc(), pointing to the name of the type 
>> specifier type (e.g., for 'unsigned int' it would return the location 
>> of 'int'). It is quite common for coders to declare builtin integral 
>> types using sign or width type specifiers only (e.g., 'unsigned', 
>> 'long', 'unsigned short', etc.). In these cases the getNameLoc() 
>> method would return an invalid source location.
> 
> What specific problem do you need to address? Do you just want 
> getNameLoc() to have a valid location (which points to the sign or width 
> if there is no actual type named), or do you want fine-grained 
> information about where the width and sign keywords were actually 
> located? The former could be done with a simple change where we build 
> BuiltinTypeLocs (without increasing memory usage), while the latter 
> requires more work (as you've done in this patch).

In our specific case, the answer lies somehow in the middle.
Besides the need for a valid source location, we do also want to know
whether or not the type, sign and width specifiers were written
(no matter if directly or indirectly via macro expansions).
So, once the source location is valid, three additional bits would be
enough for our specific purposes.

>> The attached patch augments BuiltinTypeLoc class by adding two source 
>> locations, one for the sign specifier and the other for the width 
>> specifier. We have added methods to get/set the three source locations.
>> We also modified method getNameLoc() so as to return what is _likely_ 
>> to be the source location of the first type specifier token in the 
>> source code, assuming that the three specifiers are provided in the 
>> "canonical" order:
>>  <sign> <width> <type>
>>
>> Please let us know if the patch is OK.
> 
> I have a few comments on this patch.
> 
> My biggest concern is that we're increasing the amount of storage needed 
> for very common cases, and I'd like us to both try to minimize the 
> amount of storage needed and to measure the actual cost of this change 
> for some non-trivial header, to figure out how much extra memory its 
> using in practice.
> 
> To minimize the amount of storage needed, we don't always want 
> BuiltinTypeLoc to store 3 locations, because only the integral types 
> actually need to store sign and width information, and that's only 
> written in the source code occasionally. So, BuiltinTypeLoc should have 
> a variable-length encoding dependent on the actual builtin type: types 
> that always only have one source location (void, bool, float, double, 
> nullptr, id, Class, SEL) should only store that location. For types that 
> may have more than one location (e.g., the integral types), the 
> BuiltinTypeLoc could encode which specifiers exist (in some prefix 
> bytes) and then only have source locations for those keywords that 
> actually show up in the source.

Ok, we have made some measurements on gcc.c as an example.
This is a 20MB source that takes up as much as 284MB of memory
(as read from column RES of top(1)) when compiled using
clang -cc1 -fsyntax-only.

The memory size overhead due to the (current) patch is ~530000 bytes.

Based on what we observed, an alternative (minimal) solution using
3 bits for just the integer builtin types would reduce this overhead
to ~133000 bytes.
[NOTE: we are assuming a 4 bytes field to store these 3 bits, because:
   a) as far as we know, there are no free bits in a SourceLocation;
   b) we do not want to disrupt memory alignment by using a char.]

We can of course improve our current patch as you suggest (storing the
source locations, but only when needed): this would result in an
overhead of ~176000 bytes.

> +  SourceRange getSourceRange() const {
> +    SourceLocation loc = getNameLoc();
> +    return SourceRange(loc, loc);
> +  }
> 
> If we're going to keep the source locations for the type/width/sign, 
> it'd be nice to the source range to cover the entire type. Storing the 
> specifiers in order (with some tag saying which source location refers 
> to which kind of keyword) would make this possible.

This would be desirable, but it is (to our eyes) a bit complicated.

We extract location info from DeclSpec: even though here we have a
source range, this would probably include other stuff such as the
storage class specifier or type qualifiers, which is not really
relevant for builtin types. Anyway, if you think that this is OK,
adding the source range to the BuiltinTypeLoc will be easy:
it could be combined with any of the three solutions mentioned
above (NOTE: having the source range, we could even drop all
the source locations used above, but we would still need at
least the three additional bits).

As for the ordering of the specifiers: there seems to be no info
about it in class DeclSpec.

> Of course, depending on how you answered the question at the top of my 
> e-mail, all of this might be moot: if you just want 
> BuiltinTypeLoc::getNameLoc() to point to some valid location for a 
> integral type described only by its width or sign, that's easy to fix 
> without any performance impact.
> 
>     - Doug
> 
> 

All considered, it is mainly a matter of deciding what to do.
To our eyes, the overhead seems not to be a real issue.

Cheers,
Enea Zaffanella.