[PATCH] D76291: [Support] Fix formatted_raw_ostream for UTF-8
Kristof Beyls via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Mon Jun 22 04:48:34 PDT 2020
kristof.beyls added a comment.
Herald added a project: LLVM.
This looks fine to me; I just have a number of nit picks.
The only part where I don't understand the code logic is around the comment starting with "If this is the final byte of a multi-byte sequence".
================
Comment at: llvm/include/llvm/Support/FormattedStream.h:23-26
/// formatted_raw_ostream - A raw_ostream that wraps another one and keeps track
/// of line and column position, allowing padding out to specific column
/// boundaries and querying the number of lines written to the stream.
///
----------------
I think it would be useful to add a description to this comment that:
(a) This class assumes that a UTF-8 encoding is used on the Stream; and
(b) "This doesn't attempt to handle everything unicode can do (combining characters, right-to-left markers, ...), but hopefully covers most things likely to be common in messages and source code we might want to print."
================
Comment at: llvm/lib/Support/FormattedStream.cpp:30
- // Keep track of the current column and line by scanning the string for
- // special characters
- for (const char *End = Ptr + Size; Ptr != End; ++Ptr) {
- ++Column;
- switch (*Ptr) {
+ auto ProcessCodePoint = [&Line, &Column](StringRef CP) {
+ int Width = sys::locale::columnWidth(CP);
----------------
<bikeshedding mode>
Given that ProcessCodePoint assumes that the Unicode code point represented in the UTF-8 encoding, maybe it be slightly better to name the lambda as ProcessUTF8CodePoint rather than ProcessCodePoint?
</bikeshedding mode>
================
Comment at: llvm/lib/Support/FormattedStream.cpp:31
+ auto ProcessCodePoint = [&Line, &Column](StringRef CP) {
+ int Width = sys::locale::columnWidth(CP);
+ // columnWidth returns -1 for non-printing characters.
----------------
I'm wondering if using sys::unicode::columnWidthUTF8 instead of sys::locale::columnWidth would be more future-proof and more clearly describe the intent that this function only processes UTF-8 and not strings encoded in other encodings?
================
Comment at: llvm/lib/Support/FormattedStream.cpp:33
+ // columnWidth returns -1 for non-printing characters.
+ if (Width != -1)
+ Column += Width;
----------------
The documentation for sys::unicode::columnWidthUTF8 documents it returns ErrorNonPrintableCharacter (-1) if the text contains non-printable characters.
Maybe it's more self-documenting to compare against ErrorNonPrintableCharacter rather than -1 in the above if condition?
================
Comment at: llvm/lib/Support/FormattedStream.cpp:36-37
+
+ // If this is the final byte of a multi-byte sequence, it can't be any of
+ // the special whitespace characters below.
+ if (CP.size() > 1)
----------------
Reading through the code linearly from the top to the bottom, I'm a bit surprised by this comment.
I would expect CP to contain exactly the bytes that when interpreted as a UTF-8 encoded Unicode character, form exactly one Unicode character.
Therefore, I'm not sure how to interpret "If this is the final byte of a multi-byte sequence.".
I'm expecting "this" to refer to "CP" in this context. But that cannot be "just" the final byte of a multi-byte sequence, unless my assumption that CP contains exactly the bytes forming a single UTF-8 encoded Unicode character is wrong.
Could CP contain a partial UTF-8 encoded character? If so, maybe it'd be better to change the name ProcessCodePoint to something that suggests that could be possible?
================
Comment at: llvm/unittests/Support/formatted_raw_ostream_test.cpp:134
+
+ // This character encodes as three bytes, so will cause the buffer to be
+ // flushed after the first byte (4 byte buffer, 3 bytes already written). We
----------------
I guess "This" refers to \u2468? If so, it'd be easier to read this comment if it was written like: "// \u2468 encodes as three bytes, ..."
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D76291/new/
https://reviews.llvm.org/D76291
More information about the cfe-commits
mailing list