[cfe-dev] Clang's string type?

Thu Aug 23 19:39:25 PDT 2018

> You can read individual characters from that array using
StringLiteral::getCodeUnit()

By "characters" do you mean a complete Unicode codepoint? because a
codepoint is just a byte for UTF-8, or just a short for UTF-16.

Let's say that I have a string in clang somewhere that is 🦁 (the Lion
emoji).

in UTF-32 aka the Unicode Scalar Value for the Lion Emoji is: U+1F981.

in UTF-8 the Lion emoji would be encoded as: 0xF0 0x9F 0xA6 0x81.

In UTF-16(BE) the Lion emoji would be encoded as: 0xD83E 0xDD81.

In UTF-16(LE) the Lion emoji would be encoded as: 0x3ED8 0x81DD

So what exactly does GetCodeUnit return?

For example, if I did that on a string that had a Surrogate Pair, would I
get the Unicode Scalar Value?

>
> Eli: "I don't follow; can't you just convert the format string from
> UTF-16/UTF-32 to UTF-8 before checking it?  (Granted, that's not
> particularly efficient, but it's rare enough that it probably doesn't
> matter.)"
>
>
>
>> and I realized a bit after posting this that converting the format
>> strings from UTF-16/wchar, to UTF-8 would probably be the best way to
>> achieve this Eli.
>>
>
> I'm just not sure how I'd handle the type matching, do you know when that
> happens in comparison to when the string/character literals would be
> converted? would that get in the way, or get messed up?
>
>
> In the clang AST, a string literal is represented as an array of integers
> of the appropriate width; the lexer converts from UTF-8 to UTF-16 or UTF-32
> at the same time it resolves escapes.  (This is necessary to compute the
> length of the string, which is part of the string literal's type.)
>
> You can check the width of the characters in a string using
> StringLiteral::getCharByteWidth().  It's 1, 2 or 4, depending on whether
> it's UTF-8, UTF-16, or UTF-32.  You can read individual characters from
> that array using StringLiteral::getCodeUnit().  Or you can grab the whole
> array using StringLiteral::getBytes() (note that the return type here is a
> bit misleading).
>
> Actually, you might not want to use a real UTF-16 to UTF-8 conversion;
> maybe better to translate all non-ASCII bytes to 0xFF or something. Not
> that it really affects the parsing, but it probably makes translating back
> to a source location along the lines of StringLiteral::getLocationOfByte
> easier.
>
> -Eli
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180823/c373b68b/attachment.html>