<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/125461>125461</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            clang mishandles a unicode escape beyond maximum codepoint limit.
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            clang
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          mrolle45
      </td>
    </tr>
</table>

<pre>
    Normally a unicode escape sequence can appear within an identifier token.  However, if the escape denotes a value > the maximum legal codepoint (0x10FFFF), this should not be a valid identifier. However, clang emits a diagnostic for the illegal codepoint, then ignores the escape sequence, and continues lexing the identifier.
Thus, for example, the string `Y\U00110000Z` is lexed as an identifier `YZ`.  The string `Y\U00110001Z` is lexed as the same identifier `YZ`.

Here's sample code, and the output from `clang -E`:
```c
#define Y\U00110000Z() "Y\U00110000Z is an identifier"
#define Y\U00110001Z() "Y\U00110001Z is an identifier"
Y\U00110000Z()
YZ()

"Y\U00110001Z is an identifier"
"Y\U00110001Z is an identifier"

```
You can see that `YZ` was defined as a macro twice, and it is called twice, once with the bad unicode escape and once without.

I don't know what you think the proper response should be, since the c++ standard indicates this as undefined behavior.  I suppose that `Y` should be an identifier by itself, and then the `\U00110000` should be an error token, followed by identifier `Z`.  You decide.  But certainly don't make `YZ` an identifier.

The error comes from trying to convert the UTF-32 character for 0x110000 to UTF-8, which (rightly) fails, and leaves the partial `Y` in its buffer, and then keeps on lexing with the character `Z`.


</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJyMVU1v2zgQ_TX0ZRCDpix_HHRI2hrtZU_pYfdGkSNrNhSpJSk7_veLkfwVtwEKBIgsDd9788g31CnR3iNWonwR5deZHnIbYtXF4Bwuy1kd7Kn6K8ROO3cCDYMnEywCJqN7hIT_DegNgtEedN-jjnCk3JIH7YEs-kwNYYQc3tDPAb6HIx4wCvUFqIHcXpEs-pAxgYaDdgOCKL6Nnzv9Tt3QgcO9dsDcfSCfQaiNfF_I3W63E2rLeLmlBKkNg7PgQ4YaJzCyd0Lm9wqM034P2FFmXkt670PKZKAJcSQn90A78aAH2vsQMd13cPGCa7S3YILP5AdM4PCd_H5CvCkR8vm1HRKXMx--6653eGaAlCOvESv5tyi__JRysZBSyn_ESgKNkGhBpwebuZxL5gCvn4IsfgEZCXWHv4US8lnI5-8YUah14rre4WjJpVNeHobcDxmaGDpeOln79E2spCh4PT-Mf4Z_qMJiQx7hoTm1EWoLQqmP71nsh0aFUp_CLD6BWXwG8zsN_Pr2OHL9Gdgf191ZwmRhGDOUECG3Ol_th6NOMHU57Td02sQA-Ui3o0aZWYx2Du3tS-BgchrHDaq1fUwvL70WhSGft_oH2OCFWmd48-EIR5ZzCgMHzL-NWH0MPUaImPrgE15SV4-0iRiSy4xQL0K9QMraWx0tkLdkdB5zw7YkGPyltxpbfaAQ5wA_IA19H9KdFezEleXh0NcnoJzQNXfn0Y8C2Nu7vf0FBGMM59k0xdC5cGQtp4cknDPFu2TRkMU5wMuQwWDMmrw7XS3r9BveNu-D0LO9HMyJ2IQO0xSZHE_jiAg8Ng4Y86j_5-vuqVBgWh21yRjHQSHfp264mAs2LP3Ykml5KEbat9md-Pw3mly6eOJQH87zqtcxk3ZXX8mzf1APTTPNxauHb4h9guAvE-x6mG6KLuZMvc1sVdhtsdUzrBbrYlOuynWhZm212iyL5VaZolgYu1mUtVxZu7arBtf1dqU2M6qUVKVUspCrsizUfF1vt02hlqpAXG3XUiwldprc3LlDNw9xP6OUBqwWqlyuFjOna3RpvMaUGqcP56z8OosVL3iqh30SS-ko5XSDyJQdVtOw6ii12ls3XkIPSanxFLy9Xka3a8hRR3k-G6Kr2pz7xMNO7YTa7Sm3Qz03oRNqx3znf099DP-iyULtRv1JqN25hUOl_g8AAP__1xlq_A">