[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for Linux (addresses PR14910).

Fri Jul 12 18:00:38 PDT 2013


================
Comment at: lib/Support/LocaleGeneric.inc:61-85
@@ +60,27 @@
+
+struct Interval {
+  int First;
+  int Last;
+};
+
+// Searches a value in the sorted interval table. Returns true if an interval
+// containing the value is found.
+static bool binarySearch(int Value, ArrayRef<Interval> Table) {
+  int Min = 0;
+  int Max = Table.size() - 1;
+
+  if (Value < Table[Min].First || Value > Table[Max].Last)
+    return false;
+  while (Max >= Min) {
+    int Mid = (Min + Max) / 2;
+    if (Value > Table[Mid].Last)
+      Min = Mid + 1;
+    else if (Value < Table[Mid].First)
+      Max = Mid - 1;
+    else
+      return true;
+  }
+
+  return false;
+}
+
----------------
Jordan Rose wrote:
> A lot of this is basically the same as what's in clang/Lex/UnicodeCharSets.h. The character sets have to stay in Clang because we didn't want Lex to have any cross-library references, but maybe we can sink the binary search, the interval and interval set types, and the validation logic down to LLVM?
Moved these pieces to include/llvm/Support/UnicodeCharRanges.h. I've also inlined getValidatedCharSets(). Was there any specific reason to make it a separate function? I'm also curious if marking the functions static is really needed here.

================
Comment at: lib/Support/LocaleGeneric.inc:39
@@ +38,3 @@
+    assert(Result == conversionOK);
+    int Width = mk_wcwidth(buf[0]);
+    if (Width < 0)
----------------
Jordan Rose wrote:
> This takes a wchar_t but you're passing it a UTF32. Please assert somewhere that sizeof(wchar_t) == sizeof(UTF32), so that we don't accidentally try to use this code on Windows or another platform that decides wchar_t should be 16 bits.
> 
> Alternately, substitute all the wchar_t in Kuhn's code for UTF32?
Yup, MSVC with 16-bit wchar_t would not like this code. Changed wchar_t to uint32_t.

================
Comment at: lib/Support/LocaleGeneric.inc:33
@@ +32,3 @@
+    assert(Increment != 0);
+    UTF32 buf[2];
+    const UTF8 *Start = reinterpret_cast<const UTF8 *>(Text.data() + i);
----------------
Jordan Rose wrote:
> Why 2?
> 
> (I was going to say "why an array at all?" and then I realized that according to the standard there's no "one-past-the-end" address for regular variables.)
Err, probably, primitive fear of pointers to nowhere ;) Changed to 1.

================
Comment at: lib/Support/LocaleGeneric.inc:53-57
@@ -9,1 +52,7 @@
+
+int columnWidth(StringRef Text) {
+  if (isLegalUTF8(Text))
+    return columnWidthUTF8(Text);
+
+  return Text.size();
 }
----------------
Jordan Rose wrote:
> Rather than make two passes over the text, it might be better to just make columnWidthUTF8 bail out on non-UTF-8 text, and then fall back to Text.size() here. I think UTF-8 is the common case.
Yup. Changed it and added tests with invalid UTF-8 strings.

================
Comment at: lib/Support/wcwidth.c:69-87
@@ +68,21 @@
+
+/* auxiliary function for binary search in interval table */
+static int bisearch(wchar_t ucs, const struct interval *table, int max) {
+  int min = 0;
+  int mid;
+
+  if (ucs < table[0].first || ucs > table[max].last)
+    return 0;
+  while (max >= min) {
+    mid = (min + max) / 2;
+    if (ucs > table[mid].last)
+      min = mid + 1;
+    else if (ucs < table[mid].first)
+      max = mid - 1;
+    else
+      return 1;
+  }
+
+  return 0;
+}
+
----------------
Jordan Rose wrote:
> Yeah, let's not have three versions of this lying around.
I had an idea to keep this file unmodified, but after all we don't seem to need a large part of it. Changed it to use our structure and binary search and added a comment with the description of modifications.


http://llvm-reviews.chandlerc.com/D1121