[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for the case of generic UTF8-capable terminal.

Thu Aug 1 12:02:50 PDT 2013

================
Comment at: unittests/Support/LocaleTest.cpp:38-54
@@ +37,19 @@
+
+  // Invalid UTF-8 strings, columnWidth should be byte count.
+  EXPECT_EQ(1, columnWidth("\344"));
+  EXPECT_EQ(2, columnWidth("\344\270"));
+  EXPECT_EQ(3, columnWidth("\344\270\033"));
+  EXPECT_EQ(3, columnWidth("\344\270\300"));
+  EXPECT_EQ(3, columnWidth("\377\366\355"));
+
+  EXPECT_EQ(5, columnWidth("qwer\344"));
+  EXPECT_EQ(6, columnWidth("qwer\344\270"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\033"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\300"));
+  EXPECT_EQ(7, columnWidth("qwer\377\366\355"));
+
+  // UTF-8 sequences longer than 4 bytes correspond to unallocated Unicode
+  // characters.
+  EXPECT_EQ(5, columnWidth("\370\200\200\200\200"));     // U+200000
+  EXPECT_EQ(6, columnWidth("\374\200\200\200\200\200")); // U+4000000
+}
----------------
Dmitri Gribenko wrote:
> As far as I understand, this handling of incorrect UTF-8 is not correct.  As far as I remember, according tot the standard, incorrect UTF and code points should either not be processed at all (=return an error), or invalid subsequences should be replaced with replacement character U+FFFD.
> 
> The interesting part is that it looks like some terminals don't follow this rule.  As far as I remember, gnome-terminal will use the replacement character, and "\370\200\200\200\200" will be rendered in 1 column.  Same for iTerm2 on OS X.  But the built-in Terminal.app displays "?????".
Incorrect UTF-8 sequences can happen in different cases including corrupt input data and when the input string is in a different encoding. Here I assume that the most frequent case is when an 8-bit encoding is used, so it makes sense to fall-back to counting bytes here.

And in general, there's no 100% correct way to handle this due to a huge variety of possible implementations: different terminals (and text-editors, if we talk about usage of these functions in clang-format) will use different logic to handle incorrect UTF-8 (and even correct, but a bit more esoteric Unicode features like bidirectional output or complex character combining).

================
Comment at: lib/Support/LocaleGeneric.inc:41-42
@@ +40,4 @@
+bool isPrint(int UCS) {
+  // Sorted list of non-overlapping intervals of code points that are not
+  // supposed to be printable.
+  static const UnicodeCharRange NonPrintableRanges[] = {
----------------
Dmitri Gribenko wrote:
> What is a good way to update these tables when a new version of Unicode standard comes out?
> 
It would boil down to reading the comment above the function and compare ranges corresponding to categories in specific bullet points with the ones in the code. For this specific function I wouldn't expect addition of the new control character ranges, surrogate ranges or default ignorable characters.

http://llvm-reviews.chandlerc.com/D1253