[cfe-commits] r68975 - in /cfe/trunk: lib/CodeGen/CodeGenModule.cpp test/CodeGen/illegal-UTF8.m

Tue Apr 14 07:19:45 PDT 2009

Neil,

Thanks a lot for your input on this (and the reference).

Much appreciated - I know you have vast experience with these issues.

snaroff

On Apr 13, 2009, at 6:26 PM, Neil Booth wrote:

> steve naroff wrote:-
>
>>> On Mon, Apr 13, 2009 at 12:08 PM, Steve Naroff <snaroff at apple.com>
>>> wrote:
>>>> Author: snaroff
>>>> Date: Mon Apr 13 14:08:08 2009
>>>> New Revision: 68975
>>>>
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=68975&view=rev
>>>> Log:
>>>> Fixed crasher in <rdar://problem/6780904> [irgen] Assertion failed:
>>>> (Result == conversionOK && "UTF-8 to UTF-16 conversion failed"),
>>>> function GetAddrOfConstantCFString, file CodeGenModule.cpp, line
>>>> 1063.
>>>
>>> We should not be letting invalid strings through Sema.  Either the
>>> Lexer or Sema needs to deal with this; it needs to either error  
>>> out or
>>> warn and "fix" the string to use a 0xFFFD.
>>>
>>> I would suggest reverting this fix because it does nothing but hide
>>> the issue.
>>>
>>
>> Are you suggesting the following is illegal?
>>
>> int main(int argc, char *argv[]) {
>>   printf("\xff\xff___WAIT___\n");
>> }
>>
>> GCC accepts this and prints the following:
>>
>> [steve-naroffs-imac-2:~/llvm/tools/clang] snaroff% ./a.out
>> ??___WAIT___
>
> \xff\xff is a two-character host and target sequence.  In target
> form, it is 2 successive characters, not 1, (narrow or wide) with
> value 255, regardless of target charset.  Because its value is
> specified with hex it cannot fail.
>
> Conversion of source strings to target strings can only be done a
> target character at a time, and it is dependent on their form in
> the source string.  So "@@__WAIT__", where I've used @ to represent
> a source character with value 255, may well (and probably should)
> give different output than "\xff\xff", whose escapes describe
> the value of target characters.  The @@ may represent one, two
> or a partial target character, depending on the source charset.
>
> If I'm saying something generally understood then I apologise for
> the intrusion.  If someone wants to read the (only?) coherent
> explanation of tranlating source strings to target form (and it is
> quite a bit more subtle than it appears at first) I highly recommend
> reading Clive Feather's document; it's the only coherent description
> of what semantics are meant to be:
>
>  http://www.open-std.org/JTC1/SC22/WG14/www/docs/n951.txt
>
> Neil.