<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8"

   href="https://bugs.llvm.org/show_bug.cgi?id=41536">41536</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[clang-cl] incorrectly encodes ordinary string literals containing universal-character-names in UTF-8

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>8.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>C++

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>Casey@Carter.net

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>blitzrakete@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Compiling this program:

  extern const char str[] = "\u0020\u00f4\u00e2";

  int main() {

    static_assert(sizeof(str) == 4, "BOOM");

    static_assert(sizeof(str) != 6, "BOOM");

  }

with "cl /FA /c" (presumably any version; notably the 19.20 release on Compiler

Explorar) succeeds, and produces assembly output containing the line:

  ?str@@3QBDB DB        ' ', 0f4H, 0e2H, 00H                    ; str

note the three universal-character-names (UCNs) have been replaced with the

appropriate corresponding WINDOWS-1252 encodings,which happen to use the same

code unit values as Unicode, as required by [lex.ccon]/9.

Compiling the same program with "clang-cl /FA /c" does not succeed: both

static_asserts fire. Commenting them out, the produced assembly contains the

line:

  .asciz        " \303\264\303\242"

which, despite using superior AT&T syntax, is substantially different. The UCNs

are now encoded in UTF-8, which will produce mojibake when output to a console

expecting WINDOWS-1252. \u0080 is another good example; cl rejects it since

U+0080 has no representation in WINDOWS-1252 whereas clang-cl encodes U+0080 in

UTF-8 without complaint.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>