[PATCH] D123064: [Clang][C++23] P2071 Named universal character escapes

Wed Jun 15 05:39:53 PDT 2022

aaron.ballman added a comment.

Some likely final coding style nits. The only thing of substance I'm waiting on is an answer to whether we need to update a license file in order to comply with the Unicode license requirements. @tstellar, any chance you can help there?

================
Comment at: clang/lib/Lex/Lexer.cpp:3237-3239
+  if (C != '{') {
+    if (Diagnose)
+      Diag(StartPtr, diag::warn_ucn_escape_incomplete);
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > Is this a case where we might want a fixit because the user did `\N` when they meant to do `\n`?
> > 
> > (Should we look for `\N(` and fixit that to `\N{`?)
> I think if we wanted to diagnose \n we should also diagnose \U, which we don't do, Maybe a follow up patch, what do you think?
> I can't imagine trying to be smart about `\N(` would be exercised  by many users
Follow-up (if at all) is fine by me!

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:41
+  constexpr bool isValid() const {
+    return Name.size() != 0 || Value == 0xFFFFFFFF;
+  }
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:45
+
+  std::string FullName() const {
+    std::string s;
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:120
+
+static bool StartsWith(StringRef Name, StringRef Needle, bool Strict,
+                       std::size_t &Consummed, char &PreviousCharInName,
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:131
+  }
+  if (Needle.size() == 0)
+    return true;
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:183
+            BufferType &Buffer, const Node *Parent = nullptr) {
+  auto N = readNode(Offset, Parent);
+  std::size_t Consummed = 0;
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:222
+// clang-format off
+constexpr const char *const hangul_syllables[][3] = {
+    { "G",  "A",   ""   },
----------------

================
Comment at: llvm/lib/Support/UnicodeNameToCodepoint.cpp:303
+  Name = Name.substr(findSyllable(Name, Strict, NameStart, T, 2));
+  if (L != -1 && V != -1 && T != -1 && Name.size() == 0) {
+    if (!Strict) {
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:1-3
+//===- utils/UnicodeData/UnicodeNameMappingGenerator.cpp - Unicode name data
+// generator -===//
+//-*- C++ -*-===//
----------------
Something looks amiss here -- no idea how we usually handle > 80 col file names, but maybe removing the `utils/UnicodeData/` prefix will suffice?

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:31
+
+static const llvm::StringRef LETTERS =
+    " _-ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:56
+
+      auto Name =
+          Line.substr(FirstSemiPos + 1, SecondSemiPos - FirstSemiPos - 1);
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:59
+
+      if (Name.size() && Name[0] == '<') {
+        // Ignore ranges of characters, as their name is either absent or
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:67
+      if (IsAliasFile) {
+        auto Kind = Line.substr(SecondSemiPos + 1);
+        if (Kind != "control" && Kind != "correction" && Kind != "alternate")
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:206
+
+      assert(N->Name.size() != 0);
+      Offsets[N] = Offset;
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:349
+
+  auto Out = fopen(argv[3], "w");
+  if (!Out) {
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:360-361
+  for (const std::pair<const char32_t, std::string> &Entry : Entries) {
+    const auto &Codepoint = Entry.first;
+    const auto &Name = Entry.second;
+    // Ignore names which are not valid.
----------------

================
Comment at: llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp:379-380
+  std::pair<std::string, std::vector<uint8_t>> Data = T.serialize();
+  const auto &Dict = Data.first;
+  const auto &Tree = Data.second;
+
----------------

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123064/new/

https://reviews.llvm.org/D123064