[cfe-dev] Wide strings and clang::StringLiteral.

Neil Booth neil at daikokuya.co.uk
Tue Dec 2 05:37:56 PST 2008


Chris Lattner wrote:-

> 
> On Nov 29, 2008, at 1:00 AM, Paolo Bolzoni wrote:
> 
> >
> > I need to convert the strings literals to other encoding, I was  
> > planning to
> > use iconv.h's functions, but I need to know the encoding of the  
> > input strings.
> >
> > So the question is, what encoding have the strings returned by
> > clang::StringLiteral::getStrData(), overall wide ones?
> 
> Hi Paolo,
> 
> I really have no idea.  We're just reading in the raw bytes from the  
> source file, so I guess it depends on whatever the source encoding  
> is.  In practice, this sounds like a really bad idea :).
> 
> Clang doesn't have any notion of an input character set at present,  
> and doesn't handle unicode escapes.  How do other compilers handle  
> input character sets?  Are there command line options to specify it?   
> Should the AST hold the string in a canonical form like UTF8?

Clang should have an idea of the encoding of its input, otherwise
it cannot reason about the characters that appear in a string
literal.  The standard imposes constraints on those characters,
and requires input source to be in the current locale.  Of course
this latter bit could be overridden with a command line switch.

Realistically I don't think there is much alternative to an internal
representation in some form of Unicode, or at least reasoning about
the input in Unicode.  This is essentially enforced by requiring
UCNs to be accepted.

As for execution charset, GCC's -fexec-charset seems a very reasonable
approach, with some kind of error character for characters not
representable in said charset.

Note that accepting UCNs in identifiers, as both C99 and C++ require,
mandates converting to some kind of canonical Unicode form for
identifiers internally, before hashing, too.

I've got some experience implementing all the above, so can give some
advice if necessary.

Neil.



More information about the cfe-dev mailing list