<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 8/23/2018 3:27 PM, Marcus Johnson

      via cfe-dev wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAFWGNQXXL0o+3BxsUuKJ_WD8YbybUBOhEjE9D80aq_OsBSMVdA@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=utf-8">

      <div dir="ltr">

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br>

            <div style="font-size:large;display:inline"

              class="gmail_default">Thanks for the link to that thread

              Tim.</div>

          </blockquote>

          <div><br>

          </div>

          <div style="font-size:large" class="gmail_default">Eli: "<span

              style="font-size:small">I don't follow; can't you just

              convert the format string from UTF-16/UTF-32 to UTF-8

              before checking it?  (Granted, that's not particularly

              efficient, but it's rare enough that it probably doesn't

              matter.)</span>"</div>

          <div><br>

          </div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

            <div style="font-size:large;display:inline"

              class="gmail_default"> and I realized a bit after posting

              this that converting the format strings from UTF-16/wchar,

              to UTF-8 would probably be the best way to achieve this

              Eli.</div>

          </blockquote>

          <div><br>

          </div>

          <div style="font-size:large" class="gmail_default">I'm just

            not sure how I'd handle the type matching, do you know when

            that happens in comparison to when the string/character

            literals would be converted? would that get in the way, or

            get messed up?</div>

        </div>

      </div>

    </blockquote>

    <br>

    In the clang AST, a string literal is represented as an array of

    integers of the appropriate width; the lexer converts from UTF-8 to

    UTF-16 or UTF-32 at the same time it resolves escapes.  (This is

    necessary to compute the length of the string, which is part of the

    string literal's type.)<br>

    <br>

    You can check the width of the characters in a string using

    StringLiteral::getCharByteWidth().  It's 1, 2 or 4, depending on

    whether it's UTF-8, UTF-16, or UTF-32.  You can read individual

    characters from that array using StringLiteral::getCodeUnit().  Or

    you can grab the whole array using StringLiteral::getBytes() (note

    that the return type here is a bit misleading).<br>

    <br>

    Actually, you might not want to use a real UTF-16 to UTF-8

    conversion; maybe better to translate all non-ASCII bytes to 0xFF or

    something. Not that it really affects the parsing, but it probably

    makes translating back to a source location along the lines of

    StringLiteral::getLocationOfByte easier.<br>

    <br>

    -Eli<br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>

  </body>

</html>