[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for the case of generic UTF8-capable terminal.
Dmitri Gribenko
gribozavr at gmail.com
Thu Aug 1 11:30:00 PDT 2013
================
Comment at: unittests/Support/LocaleTest.cpp:38-54
@@ +37,19 @@
+
+ // Invalid UTF-8 strings, columnWidth should be byte count.
+ EXPECT_EQ(1, columnWidth("\344"));
+ EXPECT_EQ(2, columnWidth("\344\270"));
+ EXPECT_EQ(3, columnWidth("\344\270\033"));
+ EXPECT_EQ(3, columnWidth("\344\270\300"));
+ EXPECT_EQ(3, columnWidth("\377\366\355"));
+
+ EXPECT_EQ(5, columnWidth("qwer\344"));
+ EXPECT_EQ(6, columnWidth("qwer\344\270"));
+ EXPECT_EQ(7, columnWidth("qwer\344\270\033"));
+ EXPECT_EQ(7, columnWidth("qwer\344\270\300"));
+ EXPECT_EQ(7, columnWidth("qwer\377\366\355"));
+
+ // UTF-8 sequences longer than 4 bytes correspond to unallocated Unicode
+ // characters.
+ EXPECT_EQ(5, columnWidth("\370\200\200\200\200")); // U+200000
+ EXPECT_EQ(6, columnWidth("\374\200\200\200\200\200")); // U+4000000
+}
----------------
As far as I understand, this handling of incorrect UTF-8 is not correct. As far as I remember, according tot the standard, incorrect UTF and code points should either not be processed at all (=return an error), or invalid subsequences should be replaced with replacement character U+FFFD.
The interesting part is that it looks like some terminals don't follow this rule. As far as I remember, gnome-terminal will use the replacement character, and "\370\200\200\200\200" will be rendered in 1 column. Same for iTerm2 on OS X. But the built-in Terminal.app displays "?????".
================
Comment at: lib/Support/LocaleGeneric.inc:41-42
@@ +40,4 @@
+bool isPrint(int UCS) {
+ // Sorted list of non-overlapping intervals of code points that are not
+ // supposed to be printable.
+ static const UnicodeCharRange NonPrintableRanges[] = {
----------------
What is a good way to update these tables when a new version of Unicode standard comes out?
http://llvm-reviews.chandlerc.com/D1253
More information about the llvm-commits
mailing list