[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint	for the case of generic UTF8-capable terminal.
    Dmitri Gribenko 
    gribozavr at gmail.com
       
    Thu Aug  1 11:30:00 PDT 2013
    
    
  
================
Comment at: unittests/Support/LocaleTest.cpp:38-54
@@ +37,19 @@
+
+  // Invalid UTF-8 strings, columnWidth should be byte count.
+  EXPECT_EQ(1, columnWidth("\344"));
+  EXPECT_EQ(2, columnWidth("\344\270"));
+  EXPECT_EQ(3, columnWidth("\344\270\033"));
+  EXPECT_EQ(3, columnWidth("\344\270\300"));
+  EXPECT_EQ(3, columnWidth("\377\366\355"));
+
+  EXPECT_EQ(5, columnWidth("qwer\344"));
+  EXPECT_EQ(6, columnWidth("qwer\344\270"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\033"));
+  EXPECT_EQ(7, columnWidth("qwer\344\270\300"));
+  EXPECT_EQ(7, columnWidth("qwer\377\366\355"));
+
+  // UTF-8 sequences longer than 4 bytes correspond to unallocated Unicode
+  // characters.
+  EXPECT_EQ(5, columnWidth("\370\200\200\200\200"));     // U+200000
+  EXPECT_EQ(6, columnWidth("\374\200\200\200\200\200")); // U+4000000
+}
----------------
As far as I understand, this handling of incorrect UTF-8 is not correct.  As far as I remember, according tot the standard, incorrect UTF and code points should either not be processed at all (=return an error), or invalid subsequences should be replaced with replacement character U+FFFD.
The interesting part is that it looks like some terminals don't follow this rule.  As far as I remember, gnome-terminal will use the replacement character, and "\370\200\200\200\200" will be rendered in 1 column.  Same for iTerm2 on OS X.  But the built-in Terminal.app displays "?????".
================
Comment at: lib/Support/LocaleGeneric.inc:41-42
@@ +40,4 @@
+bool isPrint(int UCS) {
+  // Sorted list of non-overlapping intervals of code points that are not
+  // supposed to be printable.
+  static const UnicodeCharRange NonPrintableRanges[] = {
----------------
What is a good way to update these tables when a new version of Unicode standard comes out?
http://llvm-reviews.chandlerc.com/D1253
    
    
More information about the llvm-commits
mailing list