[cfe-dev] Code Ranges of Tokens & AST Elements

Wed Aug 25 07:40:02 PDT 2010

On Aug 25, 2010, at 5:24 AM, Philipp Legrum wrote:

> Hello Clang Developers,
> 
> working on a project that relies on source ranges printed out by the
> clang -ast-print-xml option I stumbled over the following problem:
> 
> A statement
> 
>  x = y + SomeVariable;
> 
> yields
> 
>  x = y + SomeVariable;
>  ~~~~~~~~~
> 
> as range for the assignment.

Note how, if I force Clang to emit a diagnostic involving this expression, the diagnostics engine gets it right:

t.c:2:5: warning: incompatible pointer types assigning to 'int *' from 'float *'
  x = y + SomeVariable;
    ^ ~~~~~~~~~~~~~~~~

What you're seeing in the XML output is how source ranges are stored internally, and (not surprisingly) the XML writer isn't properly adjusting the end of the source range to make the resulting XML useful. Add it to the list of problems with the XML format, I guess.

> In general, the end locations of ranges seem to be a neglected child
> in Clang (as it seems to be in most compiler frontends). Quite often,
> SourceRange gets converted to SourceLocation and vice versa loosing
> the EndLocation or generationg SourceRanges with no span,
> respectively.

Clang's source-range information is actually quite good, but you're misinterpreting the meaning of a range. Clang represents most source ranges by [first, last], where first and last each point to the beginning of their respective tokens. To map from this representation to to a character-based representation, the 'last' location needs to be adjusted to point to (or past) the end of that token with, e.g., Lexer::MeasureTokenLength. Clients that need to map down to the character level (diagnostics, interfaces with the outside world) need to know to do this, and all of them do except the XML output.

There are rare cases in Clang itself where we need character-level information, such as Fix-Its for printf format specifiers. For those, we use CharSourceRange.

> 1.) In my opinion the root of all range-evilness of Clang is the constructor
>      SourceRange(SourceLocation loc) : B(loc), E(loc) {} (SourceLocation.h)
>    which should be removed because locations and ranges are
> conceptually different things.

This constructor is fine: it constructs a range that covers one token, as is common for declaration references and literals.

> 2.) What about Token featuring a ::getRange() instead of or in
> addition to a ::getLocation() method.

Well, it would just return the same thing as getLocation() does, but in SourceRange form. I guess one could add a convenience function to return a CharSourceRange, since tokens know their length, but I've never seen a use for it.

> While fixing a couple of cases that matter to me most, a ::getRange()
> has proven handy.

I'm guessing that those fixes weren't actually fixes, once you understand the (admittedly non-intuitive but actually quite wonderful) SourceRange model. If there are still fixes we'd love to hear about them!

> 3.) Would the correct usage of ranges instead of locations enlarge the
> memory footprint inacceptably?

I think this question is moot now, but in general: increasing the memory footprint is acceptable when it improves Clang and we can't find an efficient way to do the same thing without increasing the memory footprint.

> I think code ranges need a thorough overhaul which involves touching
> lots of source code. Is there any plans to get that fixed? What
> priority is that?

If you still think there's something that needs to change, I'd like to hear about it.

	- Doug