<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 8/23/2018 3:27 PM, Marcus Johnson
via cfe-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAFWGNQXXL0o+3BxsUuKJ_WD8YbybUBOhEjE9D80aq_OsBSMVdA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><br>
<div style="font-size:large;display:inline"
class="gmail_default">Thanks for the link to that thread
Tim.</div>
</blockquote>
<div><br>
</div>
<div style="font-size:large" class="gmail_default">Eli: "<span
style="font-size:small">I don't follow; can't you just
convert the format string from UTF-16/UTF-32 to UTF-8
before checking it? (Granted, that's not particularly
efficient, but it's rare enough that it probably doesn't
matter.)</span>"</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">
<div style="font-size:large;display:inline"
class="gmail_default"> and I realized a bit after posting
this that converting the format strings from UTF-16/wchar,
to UTF-8 would probably be the best way to achieve this
Eli.</div>
</blockquote>
<div><br>
</div>
<div style="font-size:large" class="gmail_default">I'm just
not sure how I'd handle the type matching, do you know when
that happens in comparison to when the string/character
literals would be converted? would that get in the way, or
get messed up?</div>
</div>
</div>
</blockquote>
<br>
In the clang AST, a string literal is represented as an array of
integers of the appropriate width; the lexer converts from UTF-8 to
UTF-16 or UTF-32 at the same time it resolves escapes. (This is
necessary to compute the length of the string, which is part of the
string literal's type.)<br>
<br>
You can check the width of the characters in a string using
StringLiteral::getCharByteWidth(). It's 1, 2 or 4, depending on
whether it's UTF-8, UTF-16, or UTF-32. You can read individual
characters from that array using StringLiteral::getCodeUnit(). Or
you can grab the whole array using StringLiteral::getBytes() (note
that the return type here is a bit misleading).<br>
<br>
Actually, you might not want to use a real UTF-16 to UTF-8
conversion; maybe better to translate all non-ASCII bytes to 0xFF or
something. Not that it really affects the parsing, but it probably
makes translating back to a source location along the lines of
StringLiteral::getLocationOfByte easier.<br>
<br>
-Eli<br>
<br>
<pre class="moz-signature" cols="72">--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>
</body>
</html>