<div dir="ltr"><div class="gmail_default" style="font-size:large">> <span style="font-size:small">You can read individual characters from that array using StringLiteral::getCodeUnit()</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">By "characters" do you mean a complete Unicode codepoint? because a codepoint is just a byte for UTF-8, or just a short for UTF-16.</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">Let's say that I have a string in clang somewhere that is </span><span style="font-size:small">🦁 (the Lion emoji).</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">in UTF-32 aka the Unicode Scalar Value for the Lion Emoji is: U+1F981.</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">in UTF-8 the Lion emoji would be encoded as: 0xF0 0x9F 0xA6 0x81.</span></div><div class="gmail_default"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">In UTF-16(BE) the Lion emoji would be encoded as: 0xD83E 0xDD81.</span></div><div class="gmail_default"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">In UTF-16(LE) the Lion emoji would be encoded as: 0x3ED8 0x81DD</span></div><div class="gmail_default"><span style="font-size:small"><br></span></div><div class="gmail_default"><span style="font-size:small">So what exactly does GetCodeUnit return?</span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small"><br></span></div><div class="gmail_default" style="font-size:large"><span style="font-size:small">For example, if I did that on a string that had a Surrogate Pair, would I get the Unicode Scalar Value?</span></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF"><blockquote type="cite"><div dir="ltr"><div class="gmail_quote">

          <div><br>

          </div>

          <div style="font-size:large">Eli: "<span style="font-size:small">I don't follow; can't you just

              convert the format string from UTF-16/UTF-32 to UTF-8

              before checking it?  (Granted, that's not particularly

              efficient, but it's rare enough that it probably doesn't

              matter.)</span>"</div>

          <div><br>

          </div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

            <div style="font-size:large;display:inline"> and I realized a bit after posting

              this that converting the format strings from UTF-16/wchar,

              to UTF-8 would probably be the best way to achieve this

              Eli.</div>

          </blockquote>

          <div><br>

          </div>

          <div style="font-size:large">I'm just

            not sure how I'd handle the type matching, do you know when

            that happens in comparison to when the string/character

            literals would be converted? would that get in the way, or

            get messed up?</div>

        </div>

      </div>

    </blockquote>

    <br>

    In the clang AST, a string literal is represented as an array of

    integers of the appropriate width; the lexer converts from UTF-8 to

    UTF-16 or UTF-32 at the same time it resolves escapes.  (This is

    necessary to compute the length of the string, which is part of the

    string literal's type.)<br>

    <br>

    You can check the width of the characters in a string using

    StringLiteral::getCharByteWidth().  It's 1, 2 or 4, depending on

    whether it's UTF-8, UTF-16, or UTF-32.  You can read individual

    characters from that array using StringLiteral::getCodeUnit().  Or

    you can grab the whole array using StringLiteral::getBytes() (note

    that the return type here is a bit misleading).<br>

    <br>

    Actually, you might not want to use a real UTF-16 to UTF-8

    conversion; maybe better to translate all non-ASCII bytes to 0xFF or

    something. Not that it really affects the parsing, but it probably

    makes translating back to a source location along the lines of

    StringLiteral::getLocationOfByte easier.<br>

    <br>

    -Eli<br>

    <br>

    <pre class="gmail-m_250289437916535107moz-signature" cols="72">-- 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>

  </div>

</blockquote></div></div>