[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sat Jul 30 23:47:57 PDT 2011

I'd like to make sure use cases such as the following work.

> #include <iostream>
> #include <string>
> #include <locale>
> #include <codecvt>
> 
> // we have to make the destructor public for wstring_convert
> template<typename I,typename E,typename S>
> class codecvt : public std::codecvt<I,E,S>
> {
> public:
> 	~codecvt() {}
> };
> 
> int main (void) {
> 	std::wstring_convert<codecvt<char16_t,char,mbstate_t>,char16_t> convert;
> 	std::cout << convert.to_bytes(u"ⅯⅯⅪ 🚀 円形パターン")
> 		<< std::endl;
> }

To that end I took a stab at it today and came up with the following.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode.patch
Type: application/octet-stream
Size: 3132 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110731/eea0e10e/attachment.obj>
-------------- next part --------------

So I've got a couple questions.

Is the lexer really the appropriate place to be doing this? Originally CodeGenModule::GetStringForStringLiteral seemed like the thing I should be modifying, but I discovered that the string literal's bytes had already been zero extended by the time it got there. Would it be reasonable for the StringLiteralParser to just produce a UTF-8 encoded internal representation of the string and leave producing the final representation until later? I think the main complication with that is that I'll have to encode UCNs with their UTF-8 representation.

If a string literal includes some invalid bytes is the right thing to do to just use the unicode replacement character (U+FFFD) and issue a warning? This would mean that every byte in a string could require four bytes in the internal representation, and it'd probably take a custom routine to do the Unicode encoding.

The patch uses a function for converting between encodings based on iconv because that's what I had laying around, but I don't think that's going to work for a real patch. Any recommendations as to what should be used instead?

I assume eventually someone will want source and execution charset configuration, but for now I'm content to assume source is UTF-8 and that that the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's native endianess. Is that good enough for now?

Thanks,
Seth

On Jul 27, 2011, at 11:22 AM, Douglas Gregor wrote:

> 
> On Jul 26, 2011, at 11:07 PM, Craig Topper wrote:
> 
>> Here's the patch to update the support page.
> 
> r136216, thanks!
> 
> 	- Doug
> 
>> On Tue, Jul 26, 2011 at 10:41 PM, Douglas Gregor <dgregor at apple.com> wrote:
>>> 
>>> On Jul 26, 2011, at 8:39 AM, Craig Topper wrote:
>>> 
>>>> Here's a new patch addressing Chris' suggestions.
>>> 
>>> Great! I've committed this as r136210, thanks so much!
>>> 
>>>       - Doug
>>> 
>>>> 
>>>> On Mon, Jul 25, 2011 at 11:14 PM, Chris Lattner <clattner at apple.com> wrote:
>>>>> 
>>>>> On Jul 25, 2011, at 10:52 PM, Craig Topper wrote:
>>>>> 
>>>>>> Doh! I accidentally warned on UCNs larger than 0xffff in UTF-32
>>>>>> character literals. This patch fixes that.
>>>>> 
>>>>> Hi Craig,
>>>>> 
>>>>> Doug is definitely the best one to review the semantics of this.  Some minor comments:
>>>>> 
>>>>>  IdentifierInfo &get(StringRef Name, tok::TokenKind TokenCode) {
>>>>>    IdentifierInfo &II = get(Name);
>>>>> +    assert(TokenCode < 512 && "TokenCode too large");
>>>>>    II.TokenID = TokenCode;
>>>>>    return II;
>>>>> 
>>>>> I would suggest instead:
>>>>> 
>>>>>  IdentifierInfo &get(StringRef Name, tok::TokenKind TokenCode) {
>>>>>    IdentifierInfo &II = get(Name);
>>>>>    II.TokenID = TokenCode;
>>>>> +    assert(II.TokenID == TokenCode && "TokenCode too large");
>>>>>    return II;
>>>>> 
>>>>> to avoid tying the '9' bit bitfield size to the magic 512 constant.
>>>>> 
>>>>> 
>>>>> class StringLiteral : public Expr {
>>>>> ...
>>>>>  unsigned ByteLength;
>>>>> -  bool IsWide;
>>>>> +  StringKind Kind;
>>>>>  bool IsPascal;
>>>>>  unsigned NumConcatenated;
>>>>> 
>>>>> sizeof(StringKind) is likely to be 4, wasting space.  I'd suggest making it an 8 bit bitfield or something.
>>>>> 
>>>>> 
>>>>> In SemaDeclAttr etc:
>>>>> 
>>>>> -    if (Str == 0 || Str->isWide()) {
>>>>> +    if (Str == 0 || Str->getKind() != StringLiteral::Ascii) {
>>>>>      S.Diag(Attr.getLoc(), diag::err_attribute_argument_n_not_string)
>>>>>          << "weakref" << 1;
>>>>> 
>>>>> I'd suggest introducing the proper helper methods and using:
>>>>> 
>>>>> -    if (Str == 0 || Str->isWide()) {
>>>>> +    if (Str == 0 || !Str->isAscii()) {
>>>>>      S.Diag(Attr.getLoc(), diag::err_attribute_argument_n_not_string)
>>>>>          << "weakref" << 1;
>>>>> 
>>>>> In Lexer.cpp:
>>>>> 
>>>>> +    // treat U like the start of an identifier.
>>>>> +    goto StartIdentifier;
>>>>> 
>>>>> Instead of doing this, I'd replace the gotos with just "return LexIdentifier(Result, CurPtr);" since "MIOpt.ReadToken" has already been done.
>>>>> 
>>>>> I would suggest fusing IsIdentifierL and IsIdentifierUTFStringPrefix into one function.
>>>>> 
>>>>> Otherwise, LGTM!
>>>>> 
>>>>> -Chris
>>>>> 
>>>>> 
>>>>> 
>>>> <unicode_strings.patch>
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> ~Craig
>> <unicode_support.patch>
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev