[PATCH] D123064: [Clang][C++23] P2071 Named universal character escapes

Mon Apr 11 09:08:37 PDT 2022

aaron.ballman added reviewers: aaron.ballman, tahonermann, clang-language-wg.
aaron.ballman added subscribers: tstellar, lattner.
aaron.ballman added a comment.

Thank you for this functionality! I did a pretty quick pass over it and have some comments.

================
Comment at: clang/include/clang/Basic/DiagnosticLexKinds.td:130-131

 def ext_delimited_escape_sequence : Extension<
-  "delimited escape sequences are a Clang extension">,
+  "%select{delimited|named}0 escape sequences are a Clang extension">,
   InGroup<DiagGroup<"delimited-escape-sequence-extension">>;
----------------
cor3ntin wrote:
> tahonermann wrote:
> > I don't see much value in combining these diagnostics since these are distinct features. The `ext_delimited_escape_sequence` name seems odd for named escape sequences too (even if both features use `{` and `}` as delimiters).
> I'm used to @aaron.ballman encouraging me to keep the number of diagnostics under control, but I'm fine keeping them separate 
While Tom's right that they're distinct features, diagnostics aren't tied directly to a feature in Clang; we try to reuse existing diagnostic text as much as possible. My preference is to keep them combined so we reduce the number of strings we build into the executable (and we make life easier for eventual localization of the diagnostics).

================
Comment at: clang/lib/Lex/Lexer.cpp:3139
       Diag(SlashLoc, diag::warn_ucn_not_valid_in_c89);
-    return 0;
+    return {};
   }
----------------
FWIW, I think using `llvm::None` instead of `{}` is more clear to readers (I doubt we're consistent with this in the code base though).

(Same comment applies other places that we're making an empty `Optional`.)

================
Comment at: clang/lib/Lex/Lexer.cpp:3226
+llvm::Optional<uint32_t> Lexer::tryReadNamedUCN(const char *&StartPtr,
+                                                const char *, Token *Result) {
+  unsigned CharSize;
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:3237-3239
+  if (C != '{') {
+    if (Diagnose)
+      Diag(StartPtr, diag::warn_ucn_escape_incomplete);
----------------
Is this a case where we might want a fixit because the user did `\N` when they meant to do `\n`?

(Should we look for `\N(` and fixit that to `\N{`?)

================
Comment at: clang/lib/Lex/Lexer.cpp:3290
+    // and we should not make invalid suggestions.
+  }
+
----------------
I think it'd be more clear to set `Res = LooseMatch->CodePoint;` before exiting here so that after this block, you can always assume there's a valid code point (this helps if we later want to add more logic after this point, too).

================
Comment at: clang/lib/Lex/Lexer.cpp:3292-3294
+  if (Diagnose && PP && !LooseMatch) {
+    Diag(BufferPtr, diag::ext_delimited_escape_sequence) << /*named*/ 1;
+  }
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:3305
+  }
+  return Res ? *Res : LooseMatch->CodePoint;
+}
----------------
Then this can become `return *Res;`

================
Comment at: clang/lib/Lex/Lexer.cpp:3315-3317
+    CodePointOpt = tryReadNumericUCN(StartPtr, SlashLoc, Result);
+
+  else if (Kind == 'N')
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:3325-3327
+  if (Result) {
+    Result->setFlag(Token::HasUCN);
+  }
----------------
I think this should live inside the other `tryRead*` functions once we know the token is definitely a UCN, in case someone finds a reason to want to call one of those concretely but not call `tryReadUCN()` to do so.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:238
+             diag::err_delimited_escape_missing_brace)
+            << std::string(1, 'o');

----------------
Can you use `"o"` instead of constructing a `std::string` for it?

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:362
+      assert(Delim != Input.end());
+      auto Res = llvm::sys::unicode::nameToCodepointLooseMatching(
+          StringRef(I, std::distance(I, Delim)));
----------------
Please spell out the type for `Res` as it's not obvious from context what it is.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:400-406
   const char *UcnBegin = ThisTokBuf;
-
   // Skip the '\u' char's.
   ThisTokBuf += 2;
-
-  bool Delimited = false;
-  bool EndDelimiterFound = false;
   bool HasError = false;

+  Delimited = false;
+  bool EndDelimiterFound = false;
----------------
Might as well clean this up a bit more

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:488
+
+  auto Res = llvm::sys::unicode::nameToCodepointLooseMatching(Name);
+  if (Res) {
----------------
Please spell the type out.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:500
+  unsigned Distance = 0;
+  auto Matches = llvm::sys::unicode::nearestMatchesForCodepointName(Name, 5);
+  assert(!Matches.empty() && "No unicode characters found");
----------------
Same here.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:501
+  auto Matches = llvm::sys::unicode::nearestMatchesForCodepointName(Name, 5);
+  assert(!Matches.empty() && "No unicode characters found");
+
----------------
Just double checking that the assertion here is valid and the function can never return an empty set.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:510
+
+    std::string s;
+    llvm::UTF32 V = Match.Value;
----------------

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:564
+  ThisTokBuf = ClosingBrace + 1;
+  auto Res = llvm::sys::unicode::nameToCodepointStrict(Name);
+  if (!Res) {
----------------
Spell out the type.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:593-597
+  } else
+    HasError =
+        !ProcessNumericUCNEscape(ThisTokBegin, ThisTokBuf, ThisTokEnd, UcnVal,
+                                 UcnLen, IsDelimitedEscapeSequence, Loc, Diags,
+                                 Features, in_char_string_literal);
----------------
May as well be consistent with the first branch, esp given that this is a multiline substatement.

================
Comment at: clang/test/Lexer/char-escapes-delimited.c:85
+  char d = '\N{}';          // expected-error {{delimited escape sequence cannot be empty}}
+  char e = '\N{';           // expected-error {{incomplete universal character name}}
+
----------------
Here's another fun test:
```
char trigraphs = '\N??<DOLLAR SIGN??>`;
```
and we should test that this magic works in the preprocessor:
```
if `\N{LATIN SMALL LETTER A}` != 'a'
#error "oh no!"
#endif
```

Might be worthwhile to have a preprocessor test to show that you can't form a named UCN through token concatenation.

================
Comment at: clang/test/Lexer/char-escapes-delimited.c:94
+  unsigned i = u'\N{GREEK CAPITAL LETTER DELTA}'; // expected-warning {{extension}}
+  char j = '\NN';                                 // expected-error {{expected '{' after '\N' escape sequence}}
+  unsigned k = u'\N{LOTUS';                       // expected-error {{incomplete universal character name}}
----------------
This has the potential to break user code that was doing `\N`: https://godbolt.org/z/rahbsssPo However, it's consistent with how we handle `\U`, `\x`, and `\u`, so it's defensible.

================
Comment at: llvm/include/llvm/ADT/StringExtras.h:329

+std::string to_hexString(uint64_t Value, bool UpperCase = true);
+
----------------
`utohexstr()` already exists on line 152 -- any reason we can't reuse that?

================
Comment at: llvm/include/llvm/Support/Unicode.h:19
+#include "llvm/ADT/SmallString.h"
+#include <vector>
+
----------------
No need to include vector?

================
Comment at: llvm/include/llvm/Support/Unicode.h:77-94
+llvm::Optional<char32_t> nameToCodepointStrict(StringRef Name);
+
+struct LooseMatchingResult {
+  char32_t CodePoint;
+  llvm::SmallString<64> Name;
+};
+
----------------
Already in the `llvm` namespace so no need for fully qualified names.

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:31-44
+struct node {
+  bool is_root = false;
+  char32_t value = 0xFFFFFFFF;
+  uint32_t children_offset = 0;
+  bool has_sibling = false;
+  uint32_t size = 0;
+  StringRef name = {};
----------------
You should rename things to match the usual coding conventions.

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:37
+  uint32_t size = 0;
+  StringRef name = {};
+  const node *parent = nullptr;
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:49
+    std::string s;
+    s.reserve(64);
+    auto n = this;
----------------
Any particular reason for 64?

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:50
+    s.reserve(64);
+    auto n = this;
+    while (n) {
----------------
(not like `auto` vs `node` saves you much typing.)

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:72-80
+  const uint32_t origin = offset;
+  node n;
+  n.parent = parent;
+  uint8_t nameInfo = UnicodeNameToCodepointIndex[offset++];
+  if (offset + 6 >= UnicodeNameToCodepointIndexSize)
+    return n;
+
----------------
(We don't typically use top-level const qualification.)

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:187-189
+  if (!DoesStartWith) {
+    return {n, false, 0};
+  }
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:191-193
+  if (name.size() - Consummed == 0 && n.value != 0xFFFFFFFF) {
+    return {n, true, n.value};
+  }
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:195
+  if (n.has_children()) {
+    auto o = n.children_offset;
+    for (;;) {
----------------
Please spell out the type (I'll stop commenting on the style nits -- you should do a pass for all that stuff at some point).

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:257
+constexpr const char32_t SBase = 0xAC00;
+// constexpr const char32_t LBase = 0x1100;
+// constexpr const char32_t VBase = 0x1161;
----------------
Plan to remove the commented out code?

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:364
+    unsigned long long v = 0;
+    // Be consistent about mandating uper casing
+    if (Strict &&
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:379-384
+static llvm::Optional<char32_t> nameToCodepoint(StringRef name, bool Strict,
+                                                BufferType &Buffer) {
+
+  if (name.empty())
+    return {};
+  llvm::Optional<char32_t> Res = nameToHangulCodePoint(name, Strict, Buffer);
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:409-411
+llvm::Optional<char32_t> nameToCodepointStrict(StringRef name) {
+
+  BufferType Buffer;
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepointGenerated.cpp:4
+// This file was generated using ./bin/UnicodeNameMappingGenerator. Do not edit
+// manually.//
+//===----------------------------------------------------------------------===//
----------------
Looks like the generator needs to add some whitespace or drop the trailing //.

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:17-18
+// List of generated names
+// Should be kept in sync with Unicode
+// "Name Derivation Rule Prefix String".
+static bool generated(char32_t c) {
----------------
Do we have something more direct to point users towards?

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:339-342
+  printf("NameAliases.txt can be found at "
+         "https://unicode.org/Public/14.0.0/ucd/NameAliases.txt\n"
+         "UnicodeData.txt can be found at "
+         "https://unicode.org/Public/14.0.0/ucd/UnicodeData.txt\n\n");
----------------
This raises a bit of a question about licensing -- we're making a derivative work from Unicode, so I think https://www.unicode.org/license.txt kicks in and we need to update our own license.txt files accordingly.

@tstellar @lattner -- can one of you weigh in on this? (tl;dr: we use this new tool to download data from Unicode.org and use it to generate llvm/lib/Support/UnicodeNameToCodepointGenerated.cpp which is committed into the source tree. The feature being implemented here is a C++23 feature and we cannot implement it without this data.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123064/new/

https://reviews.llvm.org/D123064