[cfe-dev] [Review Request] char16_t and char32_t character literals
Howard Hinnant
hhinnant at apple.com
Sun May 22 14:34:40 PDT 2011
On May 22, 2011, at 5:22 PM, Sean Hunt wrote:
> On 11-05-22 06:10 AM, Yusaku Shiga wrote:
>> * TODO
>>
>> (1) No Code Conversion.
>> At this point, only ascii characters are available in the char16_t and
>> char32_t constants because
>> I have not implemented code conversion logic. I plan to fix the problem
>> in next patch to support
>> chart16_t and char32_t string literals.
>
> Clang needs full UTF-8 support, both inside and outside string-literals.
> Please bear this in mind when coding support for it, otherwise it will
> be just as much work to put full UTF-8 support in as it already is, and
> you'll have wasted effort.
>
> The intended design is to convert universal-character-names to UTF-8
> internally, which we do not currently do.
If it helps, libc++ has conversion among all of UTF-8, UTF-16 and UTF-32 (well UCS4 actually). It is in locale.cpp (http://llvm.org/svn/llvm-project/libcxx/trunk/src/locale.cpp). Look for:
static
codecvt_base::result
utf16_to_utf8(const uint16_t* frm, const uint16_t* frm_end, const uint16_t*& frm_nxt,
uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))
static
codecvt_base::result
utf16_to_utf8(const uint32_t* frm, const uint32_t* frm_end, const uint32_t*& frm_nxt,
uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))
static
codecvt_base::result
utf8_to_utf16(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
uint16_t* to, uint16_t* to_end, uint16_t*& to_nxt,
unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))
static
codecvt_base::result
utf8_to_utf16(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
uint32_t* to, uint32_t* to_end, uint32_t*& to_nxt,
unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))
static
codecvt_base::result
ucs4_to_utf8(const uint32_t* frm, const uint32_t* frm_end, const uint32_t*& frm_nxt,
uint8_t* to, uint8_t* to_end, uint8_t*& to_nxt,
unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))
static
codecvt_base::result
utf8_to_ucs4(const uint8_t* frm, const uint8_t* frm_end, const uint8_t*& frm_nxt,
uint32_t* to, uint32_t* to_end, uint32_t*& to_nxt,
unsigned long Maxcode = 0x10FFFF, codecvt_mode mode = codecvt_mode(0))
etc.
Also I made myself this "cheat sheet" for summarizing the UTF encodings:
http://home.roadrunner.com/~hinnant/utf_summary.html
If this stuff is helpful, great, if not, that's fine too. I just didn't want it to be hidden.
Howard
More information about the cfe-dev
mailing list