[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for the case of generic UTF8-capable terminal.

Dmitri Gribenko gribozavr at gmail.com
Fri Aug 2 00:37:08 PDT 2013



================
Comment at: unittests/Support/LocaleTest.cpp:38-54
@@ +37,19 @@
+
+  // Invalid UTF-8 strings, columnWidth should be byte count.
+  EXPECT_EQ(1, columnWidth("\344"));
+  EXPECT_EQ(2, columnWidth("\344\270"));
+  EXPECT_EQ(3, columnWidth("\344\270\033"));
+  EXPECT_EQ(3, columnWidth("\344\270\300"));
+  EXPECT_EQ(3, columnWidth("\377\366\355"));
+
+  EXPECT_EQ(5, columnWidth("qwer\344"));
+  EXPECT_EQ(6, columnWidth("qwer\344\270"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\033"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\300"));
+  EXPECT_EQ(7, columnWidth("qwer\377\366\355"));
+
+  // UTF-8 sequences longer than 4 bytes correspond to unallocated Unicode
+  // characters.
+  EXPECT_EQ(5, columnWidth("\370\200\200\200\200"));     // U+200000
+  EXPECT_EQ(6, columnWidth("\374\200\200\200\200\200")); // U+4000000
+}
----------------
Dmitri Gribenko wrote:
> Alexander Kornienko wrote:
> > Dmitri Gribenko wrote:
> > > Dmitri Gribenko wrote:
> > > > Alexander Kornienko wrote:
> > > > > Dmitri Gribenko wrote:
> > > > > > As far as I understand, this handling of incorrect UTF-8 is not correct.  As far as I remember, according tot the standard, incorrect UTF and code points should either not be processed at all (=return an error), or invalid subsequences should be replaced with replacement character U+FFFD.
> > > > > > 
> > > > > > The interesting part is that it looks like some terminals don't follow this rule.  As far as I remember, gnome-terminal will use the replacement character, and "\370\200\200\200\200" will be rendered in 1 column.  Same for iTerm2 on OS X.  But the built-in Terminal.app displays "?????".
> > > > > Incorrect UTF-8 sequences can happen in different cases including corrupt input data and when the input string is in a different encoding. Here I assume that the most frequent case is when an 8-bit encoding is used, so it makes sense to fall-back to counting bytes here.
> > > > > 
> > > > > And in general, there's no 100% correct way to handle this due to a huge variety of possible implementations: different terminals (and text-editors, if we talk about usage of these functions in clang-format) will use different logic to handle incorrect UTF-8 (and even correct, but a bit more esoteric Unicode features like bidirectional output or complex character combining).
> > > > While I would agree that there is no way to be consistent with different terminal apps that implement this differently, in my interpretation, Unicode standard specifies that the only two ways to handle an erroneous sequence is to signal an error or use the U+FFFD.  And at least gnome-terminal and iTerm2 are consistent with this.
> > > > 
> > > > Conformance requirement C7:
> > > > If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text.
> > > > 
> > > > C10:
> > > > When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
> > > > 
> > > The U+FFFD insertion algoritm is described on page 96.  According to it, Terminal.app is actually correct to display ????? (and it handles other sequences that I tried correctly and consistently with the recommended algorithm).  iTerm2 is wrong and I'll file a bug to the developers.
> > > 
> > So, we can consider current behavior conforming to the standard? ;)
> > 
> > Actually, it's not about handling incorrect UTF-8 sequences (columnWidth doesn't strictly require the text to be in UTF-8, even though Clang doesn't support anything else), it's about the fact that we don't know encoding for sure, and for the case of 8-bit encodings this behavior is preferred.
> > So, we can consider current behavior conforming to the standard? ;)
> 
> No.  Consider "\x61\xf1\x80\x80".  Terminal.app prints "a?".
> 
> > it's about the fact that we don't know encoding for sure
> 
> Terminals in modern Linux and OS X run in UTF-8.  Even if the source is in an 8-bit encoding, Linux and OS X will display it as UTF-8, and we want carets to line up with the displayed source.
> 
Actually, here's what we can do to get columnWidth() for invalid UTF-8 cases consistent with the actual column count of the printed line of source code.  We should just do the replacement ourselves before we print a line of source code to ensure that we only ever print valid UTF-8 sequences.  Then, as long as columnWidth() and replacement algorithms agree, the caret will be positioned correctly.


http://llvm-reviews.chandlerc.com/D1253



More information about the llvm-commits mailing list