[cfe-commits] r68975 - in /cfe/trunk: lib/CodeGen/CodeGenModule.cpp test/CodeGen/illegal-UTF8.m

Mon Apr 13 15:26:11 PDT 2009

steve naroff wrote:-

> > On Mon, Apr 13, 2009 at 12:08 PM, Steve Naroff <snaroff at apple.com>  
> > wrote:
> >> Author: snaroff
> >> Date: Mon Apr 13 14:08:08 2009
> >> New Revision: 68975
> >>
> >> URL: http://llvm.org/viewvc/llvm-project?rev=68975&view=rev
> >> Log:
> >> Fixed crasher in <rdar://problem/6780904> [irgen] Assertion failed:  
> >> (Result == conversionOK && "UTF-8 to UTF-16 conversion failed"),  
> >> function GetAddrOfConstantCFString, file CodeGenModule.cpp, line  
> >> 1063.
> >
> > We should not be letting invalid strings through Sema.  Either the
> > Lexer or Sema needs to deal with this; it needs to either error out or
> > warn and "fix" the string to use a 0xFFFD.
> >
> > I would suggest reverting this fix because it does nothing but hide  
> > the issue.
> >
> 
> Are you suggesting the following is illegal?
> 
> int main(int argc, char *argv[]) {
>    printf("\xff\xff___WAIT___\n");
> }
> 
> GCC accepts this and prints the following:
> 
> [steve-naroffs-imac-2:~/llvm/tools/clang] snaroff% ./a.out
> ??___WAIT___

\xff\xff is a two-character host and target sequence.  In target
form, it is 2 successive characters, not 1, (narrow or wide) with
value 255, regardless of target charset.  Because its value is
specified with hex it cannot fail.

Conversion of source strings to target strings can only be done a
target character at a time, and it is dependent on their form in
the source string.  So "@@__WAIT__", where I've used @ to represent
a source character with value 255, may well (and probably should)
give different output than "\xff\xff", whose escapes describe
the value of target characters.  The @@ may represent one, two
or a partial target character, depending on the source charset.

If I'm saying something generally understood then I apologise for
the intrusion.  If someone wants to read the (only?) coherent
explanation of tranlating source strings to target form (and it is
quite a bit more subtle than it appears at first) I highly recommend
reading Clive Feather's document; it's the only coherent description
of what semantics are meant to be:

  http://www.open-std.org/JTC1/SC22/WG14/www/docs/n951.txt

Neil.