[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases
scshunt at csclub.uwaterloo.ca
Sun Jul 31 00:58:11 PDT 2011
On Sun, Jul 31, 2011 at 00:48, Eli Friedman <eli.friedman at gmail.com> wrote:>
So I've got a couple questions.
> > Is the lexer really the appropriate place to be doing this? Originally
> CodeGenModule::GetStringForStringLiteral seemed like the thing I should be
> modifying, but I discovered that the string literal's bytes had already been
> zero extended by the time it got there. Would it be reasonable for the
> StringLiteralParser to just produce a UTF-8 encoded internal representation
> of the string and leave producing the final representation until later? I
> think the main complication with that is that I'll have to encode UCNs with
> their UTF-8 representation.
> Given the possibility of character escapes which can't be represented
> in UTF-8, I'm not sure we can...
What possibility is this? \UFFFFFFFF is far from valid, and no other
character escape can get anywhere near that high.
In previous discussions around this concept, I believe the general consensus
has been to use UTF-8 as the canonical encoding internally - this becomes
particularly important once we support universal-character-names inside
identifiers. If the input is to be in a different encoding, the driver
should be responsible for the conversion. The internals, including Lexer,
should be allowed to assume that the input is in UTF-8.
> > If a string literal includes some invalid bytes is the right thing to do
> to just use the unicode replacement character (U+FFFD) and issue a warning?
> This would mean that every byte in a string could require four bytes in the
> internal representation, and it'd probably take a custom routine to do the
> Unicode encoding.
> We probably want to issue an error if the encoding of the file isn't
> valid... it indicates the file is either messed up or isn't using the
> encoding we think it is.
Agreed, since it's the only sensible way to handle a failure of the
above-mentioned assumption that the input is UTF-8.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the cfe-dev