[PATCH] Implemented llvm::sys::locale::columnWidth and isPrint for Linux (addresses PR14910).
Jordan Rose
jordan_rose at apple.com
Fri Jul 12 09:47:16 PDT 2013
Looks reasonable. Not going to comment on what isPrint ought to do; your interpretation seems reasonable.
================
Comment at: unittests/Support/LocaleTest.cpp:29-30
@@ +28,4 @@
+
+ EXPECT_EQ(1, columnWidth("\340\270\201")); // 0E01 THAI CHARACTER KO KAI
+ EXPECT_EQ(2, columnWidth("\344\270\200")); // CJK UNIFIED IDEOGRAPH-4E00
+}
----------------
Would be nice to have a few tests with strings containing characters of mixed column width.
================
Comment at: lib/Support/LocaleGeneric.inc:61-85
@@ +60,27 @@
+
+struct Interval {
+ int First;
+ int Last;
+};
+
+// Searches a value in the sorted interval table. Returns true if an interval
+// containing the value is found.
+static bool binarySearch(int Value, ArrayRef<Interval> Table) {
+ int Min = 0;
+ int Max = Table.size() - 1;
+
+ if (Value < Table[Min].First || Value > Table[Max].Last)
+ return false;
+ while (Max >= Min) {
+ int Mid = (Min + Max) / 2;
+ if (Value > Table[Mid].Last)
+ Min = Mid + 1;
+ else if (Value < Table[Mid].First)
+ Max = Mid - 1;
+ else
+ return true;
+ }
+
+ return false;
+}
+
----------------
A lot of this is basically the same as what's in clang/Lex/UnicodeCharSets.h. The character sets have to stay in Clang because we didn't want Lex to have any cross-library references, but maybe we can sink the binary search, the interval and interval set types, and the validation logic down to LLVM?
================
Comment at: lib/Support/LocaleGeneric.inc:53-57
@@ -9,1 +52,7 @@
+
+int columnWidth(StringRef Text) {
+ if (isLegalUTF8(Text))
+ return columnWidthUTF8(Text);
+
+ return Text.size();
}
----------------
Rather than make two passes over the text, it might be better to just make columnWidthUTF8 bail out on non-UTF-8 text, and then fall back to Text.size() here. I think UTF-8 is the common case.
================
Comment at: lib/Support/LocaleGeneric.inc:33
@@ +32,3 @@
+ assert(Increment != 0);
+ UTF32 buf[2];
+ const UTF8 *Start = reinterpret_cast<const UTF8 *>(Text.data() + i);
----------------
Why 2?
(I was going to say "why an array at all?" and then I realized that according to the standard there's no "one-past-the-end" address for regular variables.)
================
Comment at: lib/Support/wcwidth.c:69-87
@@ +68,21 @@
+
+/* auxiliary function for binary search in interval table */
+static int bisearch(wchar_t ucs, const struct interval *table, int max) {
+ int min = 0;
+ int mid;
+
+ if (ucs < table[0].first || ucs > table[max].last)
+ return 0;
+ while (max >= min) {
+ mid = (min + max) / 2;
+ if (ucs > table[mid].last)
+ min = mid + 1;
+ else if (ucs < table[mid].first)
+ max = mid - 1;
+ else
+ return 1;
+ }
+
+ return 0;
+}
+
----------------
Yeah, let's not have three versions of this lying around.
================
Comment at: lib/Support/LocaleGeneric.inc:39
@@ +38,3 @@
+ assert(Result == conversionOK);
+ int Width = mk_wcwidth(buf[0]);
+ if (Width < 0)
----------------
This takes a wchar_t but you're passing it a UTF32. Please assert somewhere that sizeof(wchar_t) == sizeof(UTF32), so that we don't accidentally try to use this code on Windows or another platform that decides wchar_t should be 16 bits.
Alternately, substitute all the wchar_t in Kuhn's code for UTF32?
http://llvm-reviews.chandlerc.com/D1121
More information about the llvm-commits
mailing list