[cfe-dev] Fixits with multibyte chars

Mon Jul 16 13:55:18 PDT 2012

On Mon, Jul 16, 2012 at 12:57 PM, Jordan Rose <jordan_rose at apple.com> wrote:
>
> On Jul 16, 2012, at 11:56 , Eli Friedman <eli.friedman at gmail.com> wrote:
>
> On Mon, Jul 16, 2012 at 11:35 AM, Richard Smith <richard at metafoo.co.uk>
> wrote:
>
> How should we behave if the file contains a byte sequence which is not valid
> UTF-8 (for instance, if arbitrary raw data is placed inside a raw string
> literal)? For the machine-parsable form, I'd feel more comfortable with
> bytes from the start of the line for this reason.
>
>
> Source code which isn't valid UTF-8 is illegal, even in raw string
> literals.  That said, we allow it in some cases anyway, so we need to
> recover consistently.
>
> We could define a consistent scheme for data which isn't UTF-8, but
> you're right, it might be easier to just use "bytes since the last
> newline".
>
> -Eli
>
>
> Three things about that:
>
> (1) The C standard explicitly permits multibyte characters (of arbitrary
> encoding) in 5.2.1.2. C++ [lex.phases]p1.1 implies a similar idea (non-basic
> characters are conceptually mapped into the source character set using
> UCNs). So saying non-UTF-8 code is "illegal" is sort of arbitrary; I'm not
> sure if we've ever documented this restriction, but we certainly don't
> enforce it.

"Illegal", in the sense that clang makes the implementation-defined
decision to unconditionally interpret source code as UTF-8.  And as
far I know, the only place where clang will silently accept bytes
which don't form valid UTF-8 codepoints is in comments.

> (2) This is for fixits, which can appear within invalid code. I'm working on
> Unicode recovery in a private branch.

Right.... I wasn't arguing we can just ignore them.

-Eli