[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases
Seth Cantrell
seth.cantrell at gmail.com
Sat Jul 30 23:47:57 PDT 2011
I'd like to make sure use cases such as the following work.
> #include <iostream>
> #include <string>
> #include <locale>
> #include <codecvt>
>
> // we have to make the destructor public for wstring_convert
> template<typename I,typename E,typename S>
> class codecvt : public std::codecvt<I,E,S>
> {
> public:
> ~codecvt() {}
> };
>
> int main (void) {
> std::wstring_convert<codecvt<char16_t,char,mbstate_t>,char16_t> convert;
> std::cout << convert.to_bytes(u"β
―β
―β
ͺ π εε½’γγΏγΌγ³")
> << std::endl;
> }
To that end I took a stab at it today and came up with the following.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode.patch
Type: application/octet-stream
Size: 3132 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110731/eea0e10e/attachment.obj>
-------------- next part --------------
So I've got a couple questions.
Is the lexer really the appropriate place to be doing this? Originally CodeGenModule::GetStringForStringLiteral seemed like the thing I should be modifying, but I discovered that the string literal's bytes had already been zero extended by the time it got there. Would it be reasonable for the StringLiteralParser to just produce a UTF-8 encoded internal representation of the string and leave producing the final representation until later? I think the main complication with that is that I'll have to encode UCNs with their UTF-8 representation.
If a string literal includes some invalid bytes is the right thing to do to just use the unicode replacement character (U+FFFD) and issue a warning? This would mean that every byte in a string could require four bytes in the internal representation, and it'd probably take a custom routine to do the Unicode encoding.
The patch uses a function for converting between encodings based on iconv because that's what I had laying around, but I don't think that's going to work for a real patch. Any recommendations as to what should be used instead?
I assume eventually someone will want source and execution charset configuration, but for now I'm content to assume source is UTF-8 and that that the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's native endianess. Is that good enough for now?
Thanks,
Seth
On Jul 27, 2011, at 11:22 AM, Douglas Gregor wrote:
>
> On Jul 26, 2011, at 11:07 PM, Craig Topper wrote:
>
>> Here's the patch to update the support page.
>
> r136216, thanks!
>
> - Doug
>
>> On Tue, Jul 26, 2011 at 10:41 PM, Douglas Gregor <dgregor at apple.com> wrote:
>>>
>>> On Jul 26, 2011, at 8:39 AM, Craig Topper wrote:
>>>
>>>> Here's a new patch addressing Chris' suggestions.
>>>
>>> Great! I've committed this as r136210, thanks so much!
>>>
>>> - Doug
>>>
>>>>
>>>> On Mon, Jul 25, 2011 at 11:14 PM, Chris Lattner <clattner at apple.com> wrote:
>>>>>
>>>>> On Jul 25, 2011, at 10:52 PM, Craig Topper wrote:
>>>>>
>>>>>> Doh! I accidentally warned on UCNs larger than 0xffff in UTF-32
>>>>>> character literals. This patch fixes that.
>>>>>
>>>>> Hi Craig,
>>>>>
>>>>> Doug is definitely the best one to review the semantics of this. Some minor comments:
>>>>>
>>>>> IdentifierInfo &get(StringRef Name, tok::TokenKind TokenCode) {
>>>>> IdentifierInfo &II = get(Name);
>>>>> + assert(TokenCode < 512 && "TokenCode too large");
>>>>> II.TokenID = TokenCode;
>>>>> return II;
>>>>>
>>>>> I would suggest instead:
>>>>>
>>>>> IdentifierInfo &get(StringRef Name, tok::TokenKind TokenCode) {
>>>>> IdentifierInfo &II = get(Name);
>>>>> II.TokenID = TokenCode;
>>>>> + assert(II.TokenID == TokenCode && "TokenCode too large");
>>>>> return II;
>>>>>
>>>>> to avoid tying the '9' bit bitfield size to the magic 512 constant.
>>>>>
>>>>>
>>>>> class StringLiteral : public Expr {
>>>>> ...
>>>>> unsigned ByteLength;
>>>>> - bool IsWide;
>>>>> + StringKind Kind;
>>>>> bool IsPascal;
>>>>> unsigned NumConcatenated;
>>>>>
>>>>> sizeof(StringKind) is likely to be 4, wasting space. I'd suggest making it an 8 bit bitfield or something.
>>>>>
>>>>>
>>>>> In SemaDeclAttr etc:
>>>>>
>>>>> - if (Str == 0 || Str->isWide()) {
>>>>> + if (Str == 0 || Str->getKind() != StringLiteral::Ascii) {
>>>>> S.Diag(Attr.getLoc(), diag::err_attribute_argument_n_not_string)
>>>>> << "weakref" << 1;
>>>>>
>>>>> I'd suggest introducing the proper helper methods and using:
>>>>>
>>>>> - if (Str == 0 || Str->isWide()) {
>>>>> + if (Str == 0 || !Str->isAscii()) {
>>>>> S.Diag(Attr.getLoc(), diag::err_attribute_argument_n_not_string)
>>>>> << "weakref" << 1;
>>>>>
>>>>> In Lexer.cpp:
>>>>>
>>>>> + // treat U like the start of an identifier.
>>>>> + goto StartIdentifier;
>>>>>
>>>>> Instead of doing this, I'd replace the gotos with just "return LexIdentifier(Result, CurPtr);" since "MIOpt.ReadToken" has already been done.
>>>>>
>>>>> I would suggest fusing IsIdentifierL and IsIdentifierUTFStringPrefix into one function.
>>>>>
>>>>> Otherwise, LGTM!
>>>>>
>>>>> -Chris
>>>>>
>>>>>
>>>>>
>>>> <unicode_strings.patch>
>>>
>>>
>>
>>
>>
>> --
>> ~Craig
>> <unicode_support.patch>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
More information about the cfe-dev
mailing list