[cfe-dev] Confusing comment on LexTokenInternal
clattner at apple.com
Mon Jul 6 22:10:09 PDT 2009
On Jul 6, 2009, at 2:10 PM, AlisdairM(public) wrote:
> I'm looking at how to add additional literals once Unicode types are
> supported, which naturally takes me to the LexTokenInternal method
> of the Lexer class. The comment describing the function talks about
> returning true/false, yet the function is void so cannot return
> I suspect the API has evolved since the comment was written, but it
> makes me wonder about the validity of other parts of this comment,
> such as assumed null-termination of the buffer.
Everything else in that function's comment still holds. Buffers are
still required to be nul terminated.
> As this code is clearly marked as performance sensitive, I am being
> quite careful before proposing changes for multiple string/character
> literal types. The simple drop-through for 'L' will no longer work
> as we have 10(!) possible literal prefixes in C++0x:
> Also, the R variants only apply to string literals, not character
Ok, eww :).
> We must preserve enough info to continue parsing as an identifier if
> we do not find the ' or " character. In the odd case we find a '
> following an R I believe we parse the chars up to and including R as
> an identifier (maybe a macro) and start a fresh narrow-character
> literal token with the ' and no intervening whitespace.
> Yuck :(
This is probably after the first translation phase, so you also have
to thing about newlines and trigraphs, double yuck :)
> I also notice that characters beyond the 7-bit ASCII range are
> deemed 'unknown' rather than potential identifiers. Ideally we
> should be checking for a valid UTF-8 sequence and allowing through
> as an identifier in that case (consuming the full glyph). This
> seems a perquisite to adding UCN support for identifiers, as UCNs
> will (presumably) map to the equivalent UTF-8 sequence. However, I
> am not sure how much further investigation here should look into GCC
> ABI to describe mangling of identifiers with extended characters.
Yes, characters in the 128->255 range should definitely be considered
potential identifiers at some point (when we support unicode better).
I would strongly recommend decomposing the problems you're working on
into orthogonal pieces. Please attack unicode before (and
independently) of raw strings, or raw strings independently of
unicode. In LLVM and Clang, we strongly prefer incremental patches
that get us going in the right direction over massive patches that
implement an entire feature.
More information about the cfe-dev