[PATCH] D150843: [clang][Diagnostics] Refactor printableTextForNextCharacter

Thu May 18 14:30:33 PDT 2023

tahonermann requested changes to this revision.
tahonermann added a comment.
This revision now requires changes to proceed.

Splitting this into two patches (one to do the renames, another to perform the other changes) would simplify review.

================
Comment at: clang/lib/Frontend/TextDiagnostic.cpp:118-119

-  unsigned char const *begin, *end;
-  begin = reinterpret_cast<unsigned char const *>(&*(SourceLine.begin() + *i));
-  end = begin + (SourceLine.size() - *i);
-
-  if (llvm::isLegalUTF8Sequence(begin, end)) {
-    llvm::UTF32 c;
-    llvm::UTF32 *cptr = &c;
-    unsigned char const *original_begin = begin;
-    unsigned char const *cp_end =
-        begin + llvm::getNumBytesForUTF8(SourceLine[*i]);
-
-    llvm::ConversionResult res = llvm::ConvertUTF8toUTF32(
-        &begin, cp_end, &cptr, cptr + 1, llvm::strictConversion);
-    (void)res;
-    assert(llvm::conversionOK == res);
-    assert(0 < begin-original_begin
-           && "we must be further along in the string now");
-    *i += begin-original_begin;
-
-    if (!llvm::sys::locale::isPrint(c)) {
-      // If next character is valid UTF-8, but not printable
-      SmallString<16> expandedCP("<U+>");
-      while (c) {
-        expandedCP.insert(expandedCP.begin()+3, llvm::hexdigit(c%16));
-        c/=16;
-      }
-      while (expandedCP.size() < 8)
-        expandedCP.insert(expandedCP.begin()+3, llvm::hexdigit(0));
-      return std::make_pair(expandedCP, false);
-    }
+  const unsigned char *Begin =
+      reinterpret_cast<unsigned char const *>(&*(SourceLine.begin())) + (*I);
+  unsigned CharSize = llvm::getNumBytesForUTF8(*Begin);
----------------

================
Comment at: clang/lib/Frontend/TextDiagnostic.cpp:120
-  begin = reinterpret_cast<unsigned char const *>(&*(SourceLine.begin() + *i));
-  end = begin + (SourceLine.size() - *i);
-
----------------
tbaeder wrote:
> I don't know what this computation of `end` means, but from the debug output I added, it meant that a call to this function always converted the entire line. I think.
It looks to me like an overly complicated way of asking for `SourceLine.bytes_end()`.

================
Comment at: clang/lib/Frontend/TextDiagnostic.cpp:121
+  unsigned CharSize = llvm::getNumBytesForUTF8(*Begin);
+  const unsigned char *End = Begin + CharSize;

----------------
This could assign `End` to a location past the end of the source line when the source line ends with an ill-formed code unit sequence (e.g., a truncated 4-byte sequence). Constructing such a pointer is likely to get one in trouble.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150843/new/

https://reviews.llvm.org/D150843