[cfe-dev] Wide strings and clang::StringLiteral.

Neil Booth neil at daikokuya.co.uk
Fri Dec 5 04:41:45 PST 2008


Chris Lattner wrote:-

> Sounds great to me.  A disclaimer: I don't know anything about this  
> stuff, Neil, I'd very much appreciate validation that this approach  
> makes sense :).
> 
> Here are some starting steps:
> 
> 1) Document StringLiteral as being canonicalized to UTF8.  We'll  
> require sema to translate the input string to utf8, and codegen and  
> other clients to convert it to the character set they want.
> 2) Add -finput-charset to clang.  Is iconv generally available (e.g.  
> on windows?) if not, we'll need some configury magic to detect it.
> 3) Teach sema about UTF8 input and iconv.  Sema should handle the  
> default cases (e.g. UTF8 and character sets where no "bad" things  
> occur) as quickly as possible, while falling back to iconv for hard  
> cases (or emitting an error if iconv isn't available).
> 4) Enhance the lexer, if required, to handle lexing strings properly.
> 5) Enhance codegen to translate into the execution char set.
> 6) Start working on character constants.
> 
> Does this seem reasonable Paolo (and Neil)?

It should work, but will break caret diagnostics I expect.

There's no real need for such flexibility though - the standard
doesn't permit UTF-16, UTF-32 etc; and I've never heard of anyone
wanting to use them, so why not just require ASCII supersets like
the standard does (for ASCII hosts)?  Then your caret diagnostics
keep working too, and special-casing the extra characters is straight
forward, even for SJIS.

The standard also requires input to be in the current locale; is
there any need to be more relaxed?  Realistically all the source
has to be in the same charset, and that charset must include the
ability to read the system headers.  You then just get to use
mbtowc in a few places.

Neil.



More information about the cfe-dev mailing list