<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><span class="vcard"><a class="email" href="mailto:efriedma@quicinc.com" title="Eli Friedman <efriedma@quicinc.com>"> <span class="fn">Eli Friedman</span></a>

</span> changed

          <a class="bz_bug_link 

          bz_status_REOPENED "

   title="REOPENED - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"

   href="https://bugs.llvm.org/show_bug.cgi?id=41536">bug 41536</a>

          <br>

             <table border="1" cellspacing="0" cellpadding="8">

          <tr>

            <th>What</th>

            <th>Removed</th>

            <th>Added</th>

          </tr>

         <tr>

           <td style="text-align:right;">Status</td>

           <td>RESOLVED

           </td>

           <td>REOPENED

           </td>

         </tr>

         <tr>

           <td style="text-align:right;">CC</td>

           <td>

           </td>

           <td>efriedma@quicinc.com

           </td>

         </tr>

         <tr>

           <td style="text-align:right;">Resolution</td>

           <td>WONTFIX

           </td>

           <td>---

           </td>

         </tr></table>

      <p>

        <div>

            <b><a class="bz_bug_link 

          bz_status_REOPENED "

   title="REOPENED - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"

   href="https://bugs.llvm.org/show_bug.cgi?id=41536#c2">Comment # 2</a>

              on <a class="bz_bug_link 

          bz_status_REOPENED "

   title="REOPENED - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"

   href="https://bugs.llvm.org/show_bug.cgi?id=41536">bug 41536</a>

              from <span class="vcard"><a class="email" href="mailto:efriedma@quicinc.com" title="Eli Friedman <efriedma@quicinc.com>"> <span class="fn">Eli Friedman</span></a>

</span></b>

        <pre>There's a real issue here, I think.  Yes, "\U" escapes specify a Unicode

character, but the standard doesn't specify how Unicode characters are encoded

(outside of u/U/u8 string literals).

Specifically, the issue here is that clang-cl has a different default from cl

for /execution-charset.

clang currently does not support anything equivalent to the MSVC

/execution-charset flag.  It assumes the source and execution charset are both

UTF-8 (as if the MSVC "/utf-8" flag was passed).  We mostly get away with this

at the moment because most source code is ASCII, and we have a hack to pass

through the raw bytes of string literals even if they aren't valid UTF-8.

It's not clear we would actually want to change the defaults here, but it seems

like a legitimate request to provide the option to specify /execution-charset

and /source-charset.

It would be a substantial project to implement /execution-charset and

/source-charset, probably. There isn't anything fundamentally tricky; for any

ASCII-compatible encoding, it's basically just a matter of translating string

literals and identifiers correctly.  (We generally don't need to translate

comments, and non-ASCII characters aren't legal anywhere else.)  But LLVM

currently doesn't have any support for translating from Unicode to non-Unicode

charsets, so it's likely to spark a complicated debate over how to perform that

translation.

See also <a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - No support for -finput-charset other than UTF-8"

   href="show_bug.cgi?id=39864">bug 39864</a>.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>