[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for the case of generic UTF8-capable terminal.

Sat Aug 3 22:26:26 PDT 2013


================
Comment at: unittests/Support/LocaleTest.cpp:38-54
@@ +37,19 @@
+
+  // Invalid UTF-8 strings, columnWidth should be byte count.
+  EXPECT_EQ(1, columnWidth("\344"));
+  EXPECT_EQ(2, columnWidth("\344\270"));
+  EXPECT_EQ(3, columnWidth("\344\270\033"));
+  EXPECT_EQ(3, columnWidth("\344\270\300"));
+  EXPECT_EQ(3, columnWidth("\377\366\355"));
+
+  EXPECT_EQ(5, columnWidth("qwer\344"));
+  EXPECT_EQ(6, columnWidth("qwer\344\270"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\033"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\300"));
+  EXPECT_EQ(7, columnWidth("qwer\377\366\355"));
+
+  // UTF-8 sequences longer than 4 bytes correspond to unallocated Unicode
+  // characters.
+  EXPECT_EQ(5, columnWidth("\370\200\200\200\200"));     // U+200000
+  EXPECT_EQ(6, columnWidth("\374\200\200\200\200\200")); // U+4000000
+}
----------------
Jordan Rose wrote:
> Alexander Kornienko wrote:
> > Jordan Rose wrote:
> > > Dmitri Gribenko wrote:
> > > > Dmitri Gribenko wrote:
> > > > > Alexander Kornienko wrote:
> > > > > > Dmitri Gribenko wrote:
> > > > > > > Dmitri Gribenko wrote:
> > > > > > > > Alexander Kornienko wrote:
> > > > > > > > > Dmitri Gribenko wrote:
> > > > > > > > > > As far as I understand, this handling of incorrect UTF-8 is not correct.  As far as I remember, according tot the standard, incorrect UTF and code points should either not be processed at all (=return an error), or invalid subsequences should be replaced with replacement character U+FFFD.
> > > > > > > > > > 
> > > > > > > > > > The interesting part is that it looks like some terminals don't follow this rule.  As far as I remember, gnome-terminal will use the replacement character, and "\370\200\200\200\200" will be rendered in 1 column.  Same for iTerm2 on OS X.  But the built-in Terminal.app displays "?????".
> > > > > > > > > Incorrect UTF-8 sequences can happen in different cases including corrupt input data and when the input string is in a different encoding. Here I assume that the most frequent case is when an 8-bit encoding is used, so it makes sense to fall-back to counting bytes here.
> > > > > > > > > 
> > > > > > > > > And in general, there's no 100% correct way to handle this due to a huge variety of possible implementations: different terminals (and text-editors, if we talk about usage of these functions in clang-format) will use different logic to handle incorrect UTF-8 (and even correct, but a bit more esoteric Unicode features like bidirectional output or complex character combining).
> > > > > > > > While I would agree that there is no way to be consistent with different terminal apps that implement this differently, in my interpretation, Unicode standard specifies that the only two ways to handle an erroneous sequence is to signal an error or use the U+FFFD.  And at least gnome-terminal and iTerm2 are consistent with this.
> > > > > > > > 
> > > > > > > > Conformance requirement C7:
> > > > > > > > If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text.
> > > > > > > > 
> > > > > > > > C10:
> > > > > > > > When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
> > > > > > > > 
> > > > > > > The U+FFFD insertion algoritm is described on page 96.  According to it, Terminal.app is actually correct to display ????? (and it handles other sequences that I tried correctly and consistently with the recommended algorithm).  iTerm2 is wrong and I'll file a bug to the developers.
> > > > > > > 
> > > > > > So, we can consider current behavior conforming to the standard? ;)
> > > > > > 
> > > > > > Actually, it's not about handling incorrect UTF-8 sequences (columnWidth doesn't strictly require the text to be in UTF-8, even though Clang doesn't support anything else), it's about the fact that we don't know encoding for sure, and for the case of 8-bit encodings this behavior is preferred.
> > > > > > So, we can consider current behavior conforming to the standard? ;)
> > > > > 
> > > > > No.  Consider "\x61\xf1\x80\x80".  Terminal.app prints "a?".
> > > > > 
> > > > > > it's about the fact that we don't know encoding for sure
> > > > > 
> > > > > Terminals in modern Linux and OS X run in UTF-8.  Even if the source is in an 8-bit encoding, Linux and OS X will display it as UTF-8, and we want carets to line up with the displayed source.
> > > > > 
> > > > Actually, here's what we can do to get columnWidth() for invalid UTF-8 cases consistent with the actual column count of the printed line of source code.  We should just do the replacement ourselves before we print a line of source code to ensure that we only ever print valid UTF-8 sequences.  Then, as long as columnWidth() and replacement algorithms agree, the caret will be positioned correctly.
> > > I don't think we should worry too much about non-UTF-8 cases. We could solve this by doing the substitution beforehand, or returning -1 and claiming we can't print this, or by straight-up ignoring the problem -- which is no worse than what we do today. The valid case is much more interesting to get right.
> > The first two users of this functions (TextDiagnostic.cpp and clang-format) will need different error-handling strategies here (clang-format doesn't have the luxury to replace "bad" UTF-8 characters with U+FFFD ;) ), so I think we should just return an error to the caller in case of UTF-8 encoding problem. This will require handling this error in TextDiagnostic.cpp, which I can add along with additional tests for clang diagnostics (which I have in a local change).
> > 
> > What do you think?
> Seems reasonable to me...we have no sensible way to calculate the column width of non-UTF8 strings. Callers will have to decide what to do in that case.
> 
> Dmitri, what do you think?
Looks good to me with this change.  Alexander, thanks for working on this!


http://llvm-reviews.chandlerc.com/D1253