[cfe-dev] Wide strings and clang::StringLiteral.
Jean-Daniel Dupas
devlists at shadowlab.org
Tue Dec 2 06:28:56 PST 2008
Le 2 déc. 08 à 14:37, Neil Booth a écrit :
> Chris Lattner wrote:-
>
>>
>> On Nov 29, 2008, at 1:00 AM, Paolo Bolzoni wrote:
>>
>>>
>>> I need to convert the strings literals to other encoding, I was
>>> planning to
>>> use iconv.h's functions, but I need to know the encoding of the
>>> input strings.
>>>
>>> So the question is, what encoding have the strings returned by
>>> clang::StringLiteral::getStrData(), overall wide ones?
>>
>> Hi Paolo,
>>
>> I really have no idea. We're just reading in the raw bytes from the
>> source file, so I guess it depends on whatever the source encoding
>> is. In practice, this sounds like a really bad idea :).
>>
>> Clang doesn't have any notion of an input character set at present,
>> and doesn't handle unicode escapes. How do other compilers handle
>> input character sets? Are there command line options to specify it?
>> Should the AST hold the string in a canonical form like UTF8?
>
> Clang should have an idea of the encoding of its input, otherwise
> it cannot reason about the characters that appear in a string
> literal. The standard imposes constraints on those characters,
> and requires input source to be in the current locale. Of course
> this latter bit could be overridden with a command line switch.
>
> Realistically I don't think there is much alternative to an internal
> representation in some form of Unicode, or at least reasoning about
> the input in Unicode. This is essentially enforced by requiring
> UCNs to be accepted.
>
> As for execution charset, GCC's -fexec-charset seems a very reasonable
> approach, with some kind of error character for characters not
> representable in said charset.
>
> Note that accepting UCNs in identifiers, as both C99 and C++ require,
> mandates converting to some kind of canonical Unicode form for
> identifiers internally, before hashing, too.
I didn't know that C99 supports UCN in identifier.
I don't see a lot of informations about it in the C99 spec (except
that UCN may appear in an identifier). Does this mean that this code
is valid ?
---------- test.c -------
int main (int argc, char **argv) {
int h\u00e9 = 0; // hé
return he\u0301; // hé - using decomposed form
}
--------------------------
Actually, GCC does not support combining character (like COMBINING
ACUTE ACCENT: 0x0301) :
test.c:4:9: error: universal character \u0301 is not valid in an
identifier
test.c: In function ‘main’:
test.c:4: error: ‘hé’ undeclared (first use in this function)
test.c:4: error: (Each undeclared identifier is reported only once
test.c:4: error: for each function it appears in.)
Note that the error is correctly displayed anyway.
> I've got some experience implementing all the above, so can give some
> advice if necessary.
>
> Neil.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20081202/03ac1cfe/attachment.html>
More information about the cfe-dev
mailing list