[cfe-dev] Wide strings and clang::StringLiteral.

Tue Dec 2 06:28:56 PST 2008

Le 2 déc. 08 à 14:37, Neil Booth a écrit :

> Chris Lattner wrote:-
>
>>
>> On Nov 29, 2008, at 1:00 AM, Paolo Bolzoni wrote:
>>
>>>
>>> I need to convert the strings literals to other encoding, I was
>>> planning to
>>> use iconv.h's functions, but I need to know the encoding of the
>>> input strings.
>>>
>>> So the question is, what encoding have the strings returned by
>>> clang::StringLiteral::getStrData(), overall wide ones?
>>
>> Hi Paolo,
>>
>> I really have no idea.  We're just reading in the raw bytes from the
>> source file, so I guess it depends on whatever the source encoding
>> is.  In practice, this sounds like a really bad idea :).
>>
>> Clang doesn't have any notion of an input character set at present,
>> and doesn't handle unicode escapes.  How do other compilers handle
>> input character sets?  Are there command line options to specify it?
>> Should the AST hold the string in a canonical form like UTF8?
>
> Clang should have an idea of the encoding of its input, otherwise
> it cannot reason about the characters that appear in a string
> literal.  The standard imposes constraints on those characters,
> and requires input source to be in the current locale.  Of course
> this latter bit could be overridden with a command line switch.
>
> Realistically I don't think there is much alternative to an internal
> representation in some form of Unicode, or at least reasoning about
> the input in Unicode.  This is essentially enforced by requiring
> UCNs to be accepted.
>
> As for execution charset, GCC's -fexec-charset seems a very reasonable
> approach, with some kind of error character for characters not
> representable in said charset.
>
> Note that accepting UCNs in identifiers, as both C99 and C++ require,
> mandates converting to some kind of canonical Unicode form for
> identifiers internally, before hashing, too.

I didn't know that C99 supports UCN in identifier.
I don't see a lot of informations about it in the C99 spec (except  
that UCN may appear in an identifier). Does this mean that this code  
is valid ?

---------- test.c -------

int main (int argc, char **argv) {
	int h\u00e9 = 0; // hé
	return he\u0301; // hé - using decomposed form
}
--------------------------

Actually, GCC does not support combining character (like COMBINING  
ACUTE ACCENT: 0x0301) :

test.c:4:9: error: universal character \u0301 is not valid in an  
identifier
test.c: In function ‘main’:
test.c:4: error: ‘hé’ undeclared (first use in this function)
test.c:4: error: (Each undeclared identifier is reported only once
test.c:4: error: for each function it appears in.)

Note that the error is correctly displayed anyway.

> I've got some experience implementing all the above, so can give some
> advice if necessary.
>
> Neil.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20081202/03ac1cfe/attachment.html>