[cfe-dev] Fixits with multibyte chars

Mon Jul 16 12:57:18 PDT 2012

On Jul 16, 2012, at 11:56 , Eli Friedman <eli.friedman at gmail.com> wrote:

> On Mon, Jul 16, 2012 at 11:35 AM, Richard Smith <richard at metafoo.co.uk> wrote:
>> How should we behave if the file contains a byte sequence which is not valid
>> UTF-8 (for instance, if arbitrary raw data is placed inside a raw string
>> literal)? For the machine-parsable form, I'd feel more comfortable with
>> bytes from the start of the line for this reason.
> 
> Source code which isn't valid UTF-8 is illegal, even in raw string
> literals.  That said, we allow it in some cases anyway, so we need to
> recover consistently.
> 
> We could define a consistent scheme for data which isn't UTF-8, but
> you're right, it might be easier to just use "bytes since the last
> newline".
> 
> -Eli

Three things about that:

(1) The C standard explicitly permits multibyte characters (of arbitrary encoding) in 5.2.1.2. C++ [lex.phases]p1.1 implies a similar idea (non-basic characters are conceptually mapped into the source character set using UCNs). So saying non-UTF-8 code is "illegal" is sort of arbitrary; I'm not sure if we've ever documented this restriction, but we certainly don't enforce it.

(2) This is for fixits, which can appear within invalid code. I'm working on Unicode recovery in a private branch.

(3) Chris has said we assume UTF-8 for now.

> We want to assume that the input charset is in UTF8 for now.  If we ever add support for other code pages, we'll either do it by rewriting the entire buffer all at once ahead of time (this is really the only option if the input is in UTF16), or by doing something else crazy like pervasively making the lexer know about single-byte codepages.  Since we only support UTF8 for now, I'd just start there.

( http://llvm.org/bugs/show_bug.cgi?id=13178#c7 )

In order to fix the crasher, I'll change the human-readable output to use printed columns, using sys::locale::columnWidth as suggested by Ben. I'll leave the parseable diagnostics as-is for now, but add a note to our manual that machine-parseable ranges use "bytes from the beginning of the line" as the index. We can revisit that at any time.

Jordan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20120716/2bc12037/attachment.html>