[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for Linux (addresses PR14910).

Fri Jul 12 09:47:16 PDT 2013

  Looks reasonable. Not going to comment on what isPrint ought to do; your interpretation seems reasonable.


================
Comment at: unittests/Support/LocaleTest.cpp:29-30
@@ +28,4 @@
+
+  EXPECT_EQ(1, columnWidth("\340\270\201")); // 0E01 THAI CHARACTER KO KAI
+  EXPECT_EQ(2, columnWidth("\344\270\200")); // CJK UNIFIED IDEOGRAPH-4E00
+}
----------------
Would be nice to have a few tests with strings containing characters of mixed column width.

================
Comment at: lib/Support/LocaleGeneric.inc:61-85
@@ +60,27 @@
+
+struct Interval {
+  int First;
+  int Last;
+};
+
+// Searches a value in the sorted interval table. Returns true if an interval
+// containing the value is found.
+static bool binarySearch(int Value, ArrayRef<Interval> Table) {
+  int Min = 0;
+  int Max = Table.size() - 1;
+
+  if (Value < Table[Min].First || Value > Table[Max].Last)
+    return false;
+  while (Max >= Min) {
+    int Mid = (Min + Max) / 2;
+    if (Value > Table[Mid].Last)
+      Min = Mid + 1;
+    else if (Value < Table[Mid].First)
+      Max = Mid - 1;
+    else
+      return true;
+  }
+
+  return false;
+}
+
----------------
A lot of this is basically the same as what's in clang/Lex/UnicodeCharSets.h. The character sets have to stay in Clang because we didn't want Lex to have any cross-library references, but maybe we can sink the binary search, the interval and interval set types, and the validation logic down to LLVM?

================
Comment at: lib/Support/LocaleGeneric.inc:53-57
@@ -9,1 +52,7 @@
+
+int columnWidth(StringRef Text) {
+  if (isLegalUTF8(Text))
+    return columnWidthUTF8(Text);
+
+  return Text.size();
 }
----------------
Rather than make two passes over the text, it might be better to just make columnWidthUTF8 bail out on non-UTF-8 text, and then fall back to Text.size() here. I think UTF-8 is the common case.

================
Comment at: lib/Support/LocaleGeneric.inc:33
@@ +32,3 @@
+    assert(Increment != 0);
+    UTF32 buf[2];
+    const UTF8 *Start = reinterpret_cast<const UTF8 *>(Text.data() + i);
----------------
Why 2?

(I was going to say "why an array at all?" and then I realized that according to the standard there's no "one-past-the-end" address for regular variables.)

================
Comment at: lib/Support/wcwidth.c:69-87
@@ +68,21 @@
+
+/* auxiliary function for binary search in interval table */
+static int bisearch(wchar_t ucs, const struct interval *table, int max) {
+  int min = 0;
+  int mid;
+
+  if (ucs < table[0].first || ucs > table[max].last)
+    return 0;
+  while (max >= min) {
+    mid = (min + max) / 2;
+    if (ucs > table[mid].last)
+      min = mid + 1;
+    else if (ucs < table[mid].first)
+      max = mid - 1;
+    else
+      return 1;
+  }
+
+  return 0;
+}
+
----------------
Yeah, let's not have three versions of this lying around.

================
Comment at: lib/Support/LocaleGeneric.inc:39
@@ +38,3 @@
+    assert(Result == conversionOK);
+    int Width = mk_wcwidth(buf[0]);
+    if (Width < 0)
----------------
This takes a wchar_t but you're passing it a UTF32. Please assert somewhere that sizeof(wchar_t) == sizeof(UTF32), so that we don't accidentally try to use this code on Windows or another platform that decides wchar_t should be 16 bits.

Alternately, substitute all the wchar_t in Kuhn's code for UTF32?


http://llvm-reviews.chandlerc.com/D1121