<div dir="ltr"><div class="gmail_default" style="font-size:large">> <span style="font-size:small">You can read individual characters from that array using StringLiteral::getCodeUnit()</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">By "characters" do you mean a complete Unicode codepoint? because a codepoint is just a byte for UTF-8, or just a short for UTF-16.</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">Let's say that I have a string in clang somewhere that is </span><span style="font-size:small">🦁 (the Lion emoji).</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">in UTF-32 aka the Unicode Scalar Value for the Lion Emoji is: U+1F981.</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">in UTF-8 the Lion emoji would be encoded as: 0xF0 0x9F 0xA6 0x81.</span></div><div class="gmail_default"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">In UTF-16(BE) the Lion emoji would be encoded as: 0xD83E 0xDD81.</span></div><div class="gmail_default"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">In UTF-16(LE) the Lion emoji would be encoded as: 0x3ED8 0x81DD</span></div><div class="gmail_default"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">So what exactly does GetCodeUnit return?</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">For example, if I did that on a string that had a Surrogate Pair, would I get the Unicode Scalar Value?</span></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite"><div dir="ltr"><div class="gmail_quote">
<div><br>
</div>
<div style="font-size:large">Eli: "<span style="font-size:small">I don't follow; can't you just
convert the format string from UTF-16/UTF-32 to UTF-8
before checking it? (Granted, that's not particularly
efficient, but it's rare enough that it probably doesn't
matter.)</span>"</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<div style="font-size:large;display:inline"> and I realized a bit after posting
this that converting the format strings from UTF-16/wchar,
to UTF-8 would probably be the best way to achieve this
Eli.</div>
</blockquote>
<div><br>
</div>
<div style="font-size:large">I'm just
not sure how I'd handle the type matching, do you know when
that happens in comparison to when the string/character
literals would be converted? would that get in the way, or
get messed up?</div>
</div>
</div>
</blockquote>
<br>
In the clang AST, a string literal is represented as an array of
integers of the appropriate width; the lexer converts from UTF-8 to
UTF-16 or UTF-32 at the same time it resolves escapes. (This is
necessary to compute the length of the string, which is part of the
string literal's type.)<br>
<br>
You can check the width of the characters in a string using
StringLiteral::getCharByteWidth(). It's 1, 2 or 4, depending on
whether it's UTF-8, UTF-16, or UTF-32. You can read individual
characters from that array using StringLiteral::getCodeUnit(). Or
you can grab the whole array using StringLiteral::getBytes() (note
that the return type here is a bit misleading).<br>
<br>
Actually, you might not want to use a real UTF-16 to UTF-8
conversion; maybe better to translate all non-ASCII bytes to 0xFF or
something. Not that it really affects the parsing, but it probably
makes translating back to a source location along the lines of
StringLiteral::getLocationOfByte easier.<br>
<br>
-Eli<br>
<br>
<pre class="gmail-m_250289437916535107moz-signature" cols="72">--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>
</div>
</blockquote></div></div>