[cfe-dev] Fixits with multibyte chars

Mon Jul 16 11:56:23 PDT 2012

On Mon, Jul 16, 2012 at 11:35 AM, Richard Smith <richard at metafoo.co.uk> wrote:
> On Mon, Jul 16, 2012 at 10:56 AM, Eli Friedman <eli.friedman at gmail.com>
> wrote:
>>
>> On Mon, Jul 16, 2012 at 10:49 AM, Jordan Rose <jordan_rose at apple.com>
>> wrote:
>> >
>> > On Jul 16, 2012, at 10:47 , Benjamin Kramer <benny.kra at gmail.com> wrote:
>> >
>> >>
>> >> On 16.07.2012, at 19:32, Jordan Rose <jordan_rose at apple.com> wrote:
>> >>
>> >>> Hi, everyone. We recently hit an assertion when trying to output a
>> >>> fixit with Unicode characters in it; it reduces down to this:
>> >>>
>> >>> void test() {
>> >>> printf("∆: %d", 1L);
>> >>> }
>> >>>
>> >>> I could of course just disable fixits when there are Unicode
>> >>> characters involved, but I'd like to fix this the right way. The trouble is
>> >>> -fdiagnostics-parseable-fixits, which is supposed to be machine-readable
>> >>> output, and in this case is a three-byte UTF-8 character three columns or
>> >>> one column? I think one column is the right way to go, but I wanted to get
>> >>> some other opinions before I start working on a patch.
>> >>
>> >> This actually depends on the system. On some systems we'll print the
>> >> unicode codepoint in hex, others will get the 1 column char. There is the
>> >> llvm::sys::locale::columnWidth function to get this information in a
>> >> portable way.
>> >
>> > Okay, that gives us two problems, then…for user-visible fixits we can
>> > use llvm::sys::locale::columnWidth (thanks, Ben), but then
>> > -fdiagnostics-parseable-fixits will have different column numbers? Is that
>> > okay?
>> >
>> > (Currently -fdiagnostics-parseable-fixits counts columns in bytes rather
>> > than characters.)
>>
>> Machine-parsable "columns" are not the same as columns in the
>> terminal.  I'm pretty sure the model we want is one "column" per
>> Unicode code point, regardless of how it is displayed.
>
>
> How should we behave if the file contains a byte sequence which is not valid
> UTF-8 (for instance, if arbitrary raw data is placed inside a raw string
> literal)? For the machine-parsable form, I'd feel more comfortable with
> bytes from the start of the line for this reason.

Source code which isn't valid UTF-8 is illegal, even in raw string
literals.  That said, we allow it in some cases anyway, so we need to
recover consistently.

We could define a consistent scheme for data which isn't UTF-8, but
you're right, it might be easier to just use "bytes since the last
newline".

-Eli