[cfe-dev] [Review Request] char16_t and char32_t character literals

Howard Hinnant hhinnant at apple.com
Sun May 22 14:34:40 PDT 2011


On May 22, 2011, at 5:22 PM, Sean Hunt wrote:

> On 11-05-22 06:10 AM, Yusaku Shiga wrote:
>> * TODO
>> 
>> (1) No Code Conversion.
>> At this point, only ascii characters are available in the char16_t and 
>> char32_t constants because
>> I have not implemented code conversion logic. I plan to fix the problem 
>> in next patch to support
>> chart16_t and char32_t string literals.
> 
> Clang needs full UTF-8 support, both inside and outside string-literals.
> Please bear this in mind when coding support for it, otherwise it will
> be just as much work to put full UTF-8 support in as it already is, and
> you'll have wasted effort.
> 
> The intended design is to convert universal-character-names to UTF-8
> internally, which we do not currently do.

If it helps, libc++ has conversion among all of UTF-8, UTF-16 and UTF-32 (well UCS4 actually).  It is in locale.cpp (http://llvm.org/svn/llvm-project/libcxx/trunk/src/locale.cpp).  Look for:

static
codecvt_base::result
utf16_to_utf8(const uint16_t* frm, const uint16_t* frm_end, const uint16_t*& frm_nxt,
              uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf16_to_utf8(const uint32_t* frm, const uint32_t* frm_end, const uint32_t*& frm_nxt,
              uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf8_to_utf16(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
              uint16_t* to, uint16_t* to_end, uint16_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf8_to_utf16(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
              uint32_t* to, uint32_t* to_end, uint32_t*& to_nxt,
              unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
ucs4_to_utf8(const uint32_t* frm, const uint32_t* frm_end, const uint32_t*& frm_nxt,
             uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
             unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

static
codecvt_base::result
utf8_to_ucs4(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
             uint32_t* to, uint32_t* to_end, uint32_t*& to_nxt,
             unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))

etc.

Also I made myself this "cheat sheet" for summarizing the UTF encodings:

http://home.roadrunner.com/~hinnant/utf_summary.html

If this stuff is helpful, great, if not, that's fine too.  I just didn't want it to be hidden.

Howard




More information about the cfe-dev mailing list