[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for the case of generic UTF8-capable terminal.

Fri Aug 2 09:02:14 PDT 2013

  I'm fine with this, though I haven't actually tested it on a Mac to see if we should bother replacing the system function calls. You and Dmitri should come to some kind of agreement before this goes in, of course.

================
Comment at: lib/Support/LocaleGeneric.inc:40
@@ +39,3 @@
+//   * surrogates (category = Cs).
+bool isPrint(int UCS) {
+  // Sorted list of non-overlapping intervals of code points that are not
----------------
Why 'int'? Why not uint32_t?

================
Comment at: lib/Support/LocaleGeneric.inc:29-39
@@ +28,13 @@
+
+// Returns false if a character is not likely to be displayed correctly on the
+// terminal. Exact implementation would have to depend on the specific terminal,
+// so we define the semantic that should be suitable for generic case of a
+// terminal capable to output Unicode characters.
+// All characters from the Unicode codepoint range are considered printable
+// except for:
+//   * C0 and C1 control character ranges;
+//   * default ignorable code points as per 5.21 of
+//     http://www.unicode.org/versions/Unicode6.2.0/UnicodeStandard-6.2.pdf
+//   * format characters (category = Cf);
+//   * surrogates (category = Cs).
+bool isPrint(int UCS) {
----------------
Doxygen comment, even for internal functions?

================
Comment at: lib/Support/LocaleGeneric.inc:179
@@ +178,3 @@
+
+static int columnWidthUTF8(StringRef Text) {
+  unsigned ColumnWidth = 0;
----------------
I feel weird overloading the return value to be both the actual value and an error code, but I can't really think of anything better. We don't have a good Either type to make up for this.

================
Comment at: unittests/Support/LocaleTest.cpp:38-54
@@ +37,19 @@
+
+  // Invalid UTF-8 strings, columnWidth should be byte count.
+  EXPECT_EQ(1, columnWidth("\344"));
+  EXPECT_EQ(2, columnWidth("\344\270"));
+  EXPECT_EQ(3, columnWidth("\344\270\033"));
+  EXPECT_EQ(3, columnWidth("\344\270\300"));
+  EXPECT_EQ(3, columnWidth("\377\366\355"));
+
+  EXPECT_EQ(5, columnWidth("qwer\344"));
+  EXPECT_EQ(6, columnWidth("qwer\344\270"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\033"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\300"));
+  EXPECT_EQ(7, columnWidth("qwer\377\366\355"));
+
+  // UTF-8 sequences longer than 4 bytes correspond to unallocated Unicode
+  // characters.
+  EXPECT_EQ(5, columnWidth("\370\200\200\200\200"));     // U+200000
+  EXPECT_EQ(6, columnWidth("\374\200\200\200\200\200")); // U+4000000
+}
----------------
Dmitri Gribenko wrote:
> Dmitri Gribenko wrote:
> > Alexander Kornienko wrote:
> > > Dmitri Gribenko wrote:
> > > > Dmitri Gribenko wrote:
> > > > > Alexander Kornienko wrote:
> > > > > > Dmitri Gribenko wrote:
> > > > > > > As far as I understand, this handling of incorrect UTF-8 is not correct.  As far as I remember, according tot the standard, incorrect UTF and code points should either not be processed at all (=return an error), or invalid subsequences should be replaced with replacement character U+FFFD.
> > > > > > > 
> > > > > > > The interesting part is that it looks like some terminals don't follow this rule.  As far as I remember, gnome-terminal will use the replacement character, and "\370\200\200\200\200" will be rendered in 1 column.  Same for iTerm2 on OS X.  But the built-in Terminal.app displays "?????".
> > > > > > Incorrect UTF-8 sequences can happen in different cases including corrupt input data and when the input string is in a different encoding. Here I assume that the most frequent case is when an 8-bit encoding is used, so it makes sense to fall-back to counting bytes here.
> > > > > > 
> > > > > > And in general, there's no 100% correct way to handle this due to a huge variety of possible implementations: different terminals (and text-editors, if we talk about usage of these functions in clang-format) will use different logic to handle incorrect UTF-8 (and even correct, but a bit more esoteric Unicode features like bidirectional output or complex character combining).
> > > > > While I would agree that there is no way to be consistent with different terminal apps that implement this differently, in my interpretation, Unicode standard specifies that the only two ways to handle an erroneous sequence is to signal an error or use the U+FFFD.  And at least gnome-terminal and iTerm2 are consistent with this.
> > > > > 
> > > > > Conformance requirement C7:
> > > > > If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text.
> > > > > 
> > > > > C10:
> > > > > When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
> > > > > 
> > > > The U+FFFD insertion algoritm is described on page 96.  According to it, Terminal.app is actually correct to display ????? (and it handles other sequences that I tried correctly and consistently with the recommended algorithm).  iTerm2 is wrong and I'll file a bug to the developers.
> > > > 
> > > So, we can consider current behavior conforming to the standard? ;)
> > > 
> > > Actually, it's not about handling incorrect UTF-8 sequences (columnWidth doesn't strictly require the text to be in UTF-8, even though Clang doesn't support anything else), it's about the fact that we don't know encoding for sure, and for the case of 8-bit encodings this behavior is preferred.
> > > So, we can consider current behavior conforming to the standard? ;)
> > 
> > No.  Consider "\x61\xf1\x80\x80".  Terminal.app prints "a?".
> > 
> > > it's about the fact that we don't know encoding for sure
> > 
> > Terminals in modern Linux and OS X run in UTF-8.  Even if the source is in an 8-bit encoding, Linux and OS X will display it as UTF-8, and we want carets to line up with the displayed source.
> > 
> Actually, here's what we can do to get columnWidth() for invalid UTF-8 cases consistent with the actual column count of the printed line of source code.  We should just do the replacement ourselves before we print a line of source code to ensure that we only ever print valid UTF-8 sequences.  Then, as long as columnWidth() and replacement algorithms agree, the caret will be positioned correctly.
I don't think we should worry too much about non-UTF-8 cases. We could solve this by doing the substitution beforehand, or returning -1 and claiming we can't print this, or by straight-up ignoring the problem -- which is no worse than what we do today. The valid case is much more interesting to get right.

http://llvm-reviews.chandlerc.com/D1253