[PATCH] D104975: Implement P1949

Sat Jun 26 07:02:45 PDT 2021

cor3ntin added a comment.

In D104975#2842425 <https://reviews.llvm.org/D104975#2842425>, @jfb wrote:

> It would be more user friendly to say which character is not allowed in the diagnostic.

Agreed

> Do we need to have a backwards compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.

I have been wondering about that.
I came to the conclusion it would probably not be worth it. Until fairly recently Unicode identifiers in GCC were not really usable and therefore not used afaik.
I haven't seen people use emojis or other interesting symbol in non toy code.

I'll try to make that clearer in the commit message

================
Comment at: clang/lib/Lex/Lexer.cpp:1462
+          static const llvm::sys::UnicodeCharSet XIDStartChars(XIDStartRanges);
+          return C == '_' || XIDStartChars.contains(C);
+  } else if (LangOpts.C11) {
----------------
Quuxplusone wrote:
> This is overly indented (or your editor snuck in some hard tabs).
> Why is `'_'` treated specially, here and on line 1444 above? Shouldn't it just be included in the `XIDStartChars` set?
This is intentional
C++ treats `_` as valid in identifiers, Unicode doesn't. Putting it in the array with the other characters is a sure-fire way to have someone drop it

================
Comment at: clang/lib/Lex/Lexer.cpp:1503-1511
-  // Check C++98 compatibility.
-  if (!Diags.isIgnored(diag::warn_cxx98_compat_unicode_id, Range.getBegin())) {
-    static const llvm::sys::UnicodeCharSet CXX03AllowedIDChars(
-        CXX03AllowedIDCharRanges);
-    if (!CXX03AllowedIDChars.contains(C)) {
-      Diags.Report(Range.getBegin(), diag::warn_cxx98_compat_unicode_id)
-        << Range;
----------------
Quuxplusone wrote:
> Why remove this diagnostic? This looks unintentional.
> 
> But if it is intentional, then besides justifying it in the commit message (and splitting out this removal into a separate PR), you should remove the enumerator from `DiagnosticLexKinds.td`.
The diagnostic should be removed indeed.
Because the PR applies the paper as a DR to all versions of C++, it is no longer useful to diagnostic changes between c++ versions

================
Comment at: clang/test/Lexer/unicode.c:31
+extern int 🌹; // expected-error {{character not allowed in identifier}} expected-warning {{declaration does not declare anything}}
+int v＝［＝］（auto）｛return～x；｝（）;;  // expected-error 12{{character not allowed in identifier}} expected-error {{expected ';'}}
+// expected-error at -1 {{expected unqualified-id}}
----------------
Quuxplusone wrote:
> Why 12 errors? Are these `{` `(` etc. characters not the ASCII versions? If so, this test is needlessly confusing, and should be rewritten with one character per test case, as in lines 43-53 below. (Except line 46, which is also bad. :))
> 
> I don't understand why you're disabling the existing tests. In particular, line 43 tests the error message for GREEK QUESTION MARK, which is important to preserve.
Having 2 sets of completely different diagnostics (for C++ and C) proved a lot more complicated than doing two separate tests.
The `GREEK QUESTION MARK` is no longer important to preserve - it tests for confusing characters - which cannot happen in C++.
I agree that the test above is not super meaningful.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104975/new/

https://reviews.llvm.org/D104975