Re: [PATCH] Let __attribute__((format(…))) accept OFStrings

Arthur O'Dwyer arthur.j.odwyer at gmail.com
Tue Nov 26 10:35:55 PST 2013


Agh, this pointless rabbit-hole was what I was trying to avoid with my
suggestion to take it off-list. :(

On Tue, Nov 26, 2013 at 10:31 AM, Jonathan Schleifer <js at webkeks.org> wrote:
> Am 26.11.2013 um 19:23 schrieb Jean-Daniel Dupas <devlists at shadowlab.org>:
>
>> That's a rather strange way to express it. UTF-16 is not more a workaround than UTF-32 or UTF-8. They are all first class encodings.
>> Cocoa supports all Unicode planes and encode them using UTF-16 (or even ASCII internally) which is generally far more space efficient than using UTF-32.
>>
>> FWIW, it is even possible to use emoji in constant NSString generated at compilation time. So telling that Cocoa can only handle UCS-2 is plainly wrong.
>
> How can a single unichar (which is typedef'd to unsigned short) store more than UCS-2? We are talking about the type for a single character (unichar vs. of_unichar_t) here. Strings internally use UTF-8 in ObjFW, but if you use characterAtIndex:, you get the whole character and not a surrogate. With Cocoa, you get a surrogate, as a single character can only be UCS-2. Try it yourself:
>
> [@"😄" length] returns 2 in Cocoa. The same returns 1 in ObjFW, because it is one of_unichar_t.
> [@"😄" characterAtIndex: 0] returns the surrogate in Cocoa. In ObjFW, it returns a single character 😄, because it fits into one of_unichar_t.
>
> Try this:
> NSLog(@"%C", [@"😄" characterAtIndex: 0]);
> It won't output 😄.
>
> OTOH with ObjFW this:
> of_log(@"%C", [@"😄" characterAtIndex: 0]);
> will output 😄.
>
> But in order to make this work, Clang may not assume that ObjFW is Cocoa and thus reject the format string.
>
> And yet, the internal representation is not UTF-32 in ObjFW. So this has nothing to do with internal representation, but with how you export a single Unicode character - it's part of the API. And Cocoa decided to export a single Unicode characters as surrogates if necessary, because a unichar is an unsigned short and 😄 doesn't fit.
>
> So where is it wrong what I said? It can handle UTF-16, sure. But it can't handle UCS-4 in a single character.
>
> --
> Jonathan
>
> _______________________________________________
> cfe-commits mailing list
> cfe-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>




More information about the cfe-commits mailing list