[PATCH] D76291: [Support] Fix formatted_raw_ostream for UTF-8

Kristof Beyls via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Jun 22 04:48:34 PDT 2020


kristof.beyls added a comment.
Herald added a project: LLVM.

This looks fine to me; I just have a number of nit picks.
The only part where I don't understand the code logic is around the comment starting with "If this is the final byte of a multi-byte sequence".



================
Comment at: llvm/include/llvm/Support/FormattedStream.h:23-26
 /// formatted_raw_ostream - A raw_ostream that wraps another one and keeps track
 /// of line and column position, allowing padding out to specific column
 /// boundaries and querying the number of lines written to the stream.
 ///
----------------
I think it would be useful to add a description to this comment that:
(a) This class assumes that a UTF-8 encoding is used on the Stream; and
(b) "This doesn't attempt to handle everything unicode can do (combining characters, right-to-left markers, ...), but hopefully covers most things likely to be common in messages and source code we might want to print."


================
Comment at: llvm/lib/Support/FormattedStream.cpp:30
 
-  // Keep track of the current column and line by scanning the string for
-  // special characters
-  for (const char *End = Ptr + Size; Ptr != End; ++Ptr) {
-    ++Column;
-    switch (*Ptr) {
+  auto ProcessCodePoint = [&Line, &Column](StringRef CP) {
+    int Width = sys::locale::columnWidth(CP);
----------------
<bikeshedding mode>
Given that ProcessCodePoint assumes that the Unicode code point represented in the UTF-8 encoding, maybe it be slightly better to name the lambda as ProcessUTF8CodePoint rather than ProcessCodePoint?
</bikeshedding mode>


================
Comment at: llvm/lib/Support/FormattedStream.cpp:31
+  auto ProcessCodePoint = [&Line, &Column](StringRef CP) {
+    int Width = sys::locale::columnWidth(CP);
+    // columnWidth returns -1 for non-printing characters.
----------------
I'm wondering if using sys::unicode::columnWidthUTF8 instead of sys::locale::columnWidth would be more future-proof and more clearly describe the intent that this function only processes UTF-8 and not strings encoded in other encodings?


================
Comment at: llvm/lib/Support/FormattedStream.cpp:33
+    // columnWidth returns -1 for non-printing characters.
+    if (Width != -1)
+      Column += Width;
----------------
The documentation for sys::unicode::columnWidthUTF8 documents it returns ErrorNonPrintableCharacter (-1) if the text contains non-printable characters.
Maybe it's more self-documenting to compare against ErrorNonPrintableCharacter rather than -1 in the above if condition?


================
Comment at: llvm/lib/Support/FormattedStream.cpp:36-37
+
+    // If this is the final byte of a multi-byte sequence, it can't be any of
+    // the special whitespace characters below.
+    if (CP.size() > 1)
----------------
Reading through the code linearly from the top to the bottom,  I'm a bit surprised by this comment.
I would expect CP to contain exactly the bytes that when interpreted as a UTF-8 encoded Unicode character, form exactly one Unicode character.
Therefore, I'm not sure how to interpret "If this is the final byte of a multi-byte sequence.".
I'm expecting "this" to refer to "CP" in this context. But that cannot be "just" the final byte of a multi-byte sequence, unless my assumption that CP contains exactly the bytes forming a single UTF-8 encoded Unicode character is wrong.
Could CP contain a partial UTF-8 encoded character? If so, maybe it'd be better to change the name ProcessCodePoint to something that suggests that could be possible?


================
Comment at: llvm/unittests/Support/formatted_raw_ostream_test.cpp:134
+
+  // This character encodes as three bytes, so will cause the buffer to be
+  // flushed after the first byte (4 byte buffer, 3 bytes already written). We
----------------
I guess "This" refers to \u2468? If so, it'd be easier to read this comment if it was written like: "// \u2468 encodes as three bytes, ..."



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D76291/new/

https://reviews.llvm.org/D76291





More information about the cfe-commits mailing list