<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><span class="vcard"><a class="email" href="mailto:efriedma@quicinc.com" title="Eli Friedman <efriedma@quicinc.com>"> <span class="fn">Eli Friedman</span></a>
</span> changed
<a class="bz_bug_link
bz_status_REOPENED "
title="REOPENED - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"
href="https://bugs.llvm.org/show_bug.cgi?id=41536">bug 41536</a>
<br>
<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>What</th>
<th>Removed</th>
<th>Added</th>
</tr>
<tr>
<td style="text-align:right;">Status</td>
<td>RESOLVED
</td>
<td>REOPENED
</td>
</tr>
<tr>
<td style="text-align:right;">CC</td>
<td>
</td>
<td>efriedma@quicinc.com
</td>
</tr>
<tr>
<td style="text-align:right;">Resolution</td>
<td>WONTFIX
</td>
<td>---
</td>
</tr></table>
<p>
<div>
<b><a class="bz_bug_link
bz_status_REOPENED "
title="REOPENED - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"
href="https://bugs.llvm.org/show_bug.cgi?id=41536#c2">Comment # 2</a>
on <a class="bz_bug_link
bz_status_REOPENED "
title="REOPENED - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"
href="https://bugs.llvm.org/show_bug.cgi?id=41536">bug 41536</a>
from <span class="vcard"><a class="email" href="mailto:efriedma@quicinc.com" title="Eli Friedman <efriedma@quicinc.com>"> <span class="fn">Eli Friedman</span></a>
</span></b>
<pre>There's a real issue here, I think. Yes, "\U" escapes specify a Unicode
character, but the standard doesn't specify how Unicode characters are encoded
(outside of u/U/u8 string literals).
Specifically, the issue here is that clang-cl has a different default from cl
for /execution-charset.
clang currently does not support anything equivalent to the MSVC
/execution-charset flag. It assumes the source and execution charset are both
UTF-8 (as if the MSVC "/utf-8" flag was passed). We mostly get away with this
at the moment because most source code is ASCII, and we have a hack to pass
through the raw bytes of string literals even if they aren't valid UTF-8.
It's not clear we would actually want to change the defaults here, but it seems
like a legitimate request to provide the option to specify /execution-charset
and /source-charset.
It would be a substantial project to implement /execution-charset and
/source-charset, probably. There isn't anything fundamentally tricky; for any
ASCII-compatible encoding, it's basically just a matter of translating string
literals and identifiers correctly. (We generally don't need to translate
comments, and non-ASCII characters aren't legal anywhere else.) But LLVM
currently doesn't have any support for translating from Unicode to non-Unicode
charsets, so it's likely to spark a complicated debate over how to perform that
translation.
See also <a class="bz_bug_link
bz_status_NEW "
title="NEW - No support for -finput-charset other than UTF-8"
href="show_bug.cgi?id=39864">bug 39864</a>.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>