<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - Clang++ accepts invalid Unicode character literals"

   href="http://llvm.org/bugs/show_bug.cgi?id=18535">18535</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Clang++ accepts invalid Unicode character literals

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>C++11

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>wjl@icecavern.net

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>dgregor@apple.com, llvmbugs@cs.uiuc.edu

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Clang++ accepts invalid Unicode character literals, such as U'\u0000'.

I found this bug when running across this issue in GCC:

<a href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59873">http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59873</a>

In that bug, I thought gcc was in error because it was munging my U'\u0000'

into the value 1 (which admittedly is BIZARRE), whereas Clang treated it with

the value I expected, which is 0.

However, it was pointed out in that bug that such a Unicode literal is

apparently invalid. This is quoted in the gcc source code in libcpp/charset.c,

apparently from the C99 standard (and I assume -- hopefully correctly -- that

this applies to C++11):

   C99 6.4.3: A universal character name shall not specify a character

   whose short identifier is less than 00A0 other than 0024 ($), 0040 (@),

   or 0060 (`), nor one in the range D800 through DFFF inclusive.

Currently clang (and gcc) yield a compiler error if you try to use something

like U'\ud800' because it is a surrogate. However, *all* other literals work

(as posted in the gcc bug, I generated a program (17 MiB of source code) which

tests every possible Unicode literal, and they all are accepted and give the

right numeric value on clang, except for surrogates which are rejected.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>