<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"
   href="https://bugs.llvm.org/show_bug.cgi?id=41536">41536</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>[clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>clang
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>8.0
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Windows NT
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>C++
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedclangbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>Casey@Carter.net
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>blitzrakete@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Compiling this program:

  extern const char str[] = "\u0020\u00f4\u00e2";

  int main() {
    static_assert(sizeof(str) == 4, "BOOM");
    static_assert(sizeof(str) != 6, "BOOM");
  }

with "cl /FA /c" (presumably any version; notably the 19.20 release on Compiler
Explorar) succeeds, and produces assembly output containing the line:

  ?str@@3QBDB DB        ' ', 0f4H, 0e2H, 00H                    ; str

note the three universal-character-names (UCNs) have been replaced with the
appropriate corresponding WINDOWS-1252 encodings,which happen to use the same
code unit values as Unicode, as required by [lex.ccon]/9.

Compiling the same program with "clang-cl /FA /c" does not succeed: both
static_asserts fire. Commenting them out, the produced assembly contains the
line:

  .asciz        " \303\264\303\242"

which, despite using superior AT&T syntax, is substantially different. The UCNs
are now encoded in UTF-8, which will produce mojibake when output to a console
expecting WINDOWS-1252. \u0080 is another good example; cl rejects it since
U+0080 has no representation in WINDOWS-1252 whereas clang-cl encodes U+0080 in
UTF-8 without complaint.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>