[PATCH] UTF-8 support for clang-format.

Wed Jun 5 06:00:51 PDT 2013


================
Comment at: unittests/Format/FormatTest.cpp:4890-4891
@@ +4889,4 @@
+               getLLVMStyleWithColumns(35));
+  verifyFormat("\"一 二 三 四 五 六 七 八 九 十\"",
+               getLLVMStyleWithColumns(21));
+  verifyFormat("// Однажды в студёную зимнюю пору...",
----------------
Daniel Jasper wrote:
> Dmitri Gribenko wrote:
> > OK, if we decide to count codepoints for now, please adjust the string length then -- it is currently computed as if a double-width character actually occupies 2 columns (but the code being tested counts it as one).
> > 
> Well, I think this is not at all what the test was meant to be testing. The point is that the string contains 22 bytes .. And IMO, it would also be 22 columns with 2-column characters. So it does in fact test the right thing.
> 
> But I also don't see harm in using getLLVMStyleWithColumns(12). If anything it makes the test stricter..
These characters are separated by spaces (to ensure clang-format can break the literal if it wants to), so the string contains 21 code points. The test is correct.

================
Comment at: unittests/Format/FormatTest.cpp:4931
@@ +4930,3 @@
+
+TEST_F(FormatTest, SplitsUTF8BlockComments) {
+  EXPECT_EQ("/* Гляжу,\n"
----------------
Daniel Jasper wrote:
> If I am correct, the chinese letters are just numbers, I hope the russian characters don't mean anything offensive ;-)...
Test cases in Russian aren't offensive in any way. I guess, Dmitry Gribenko would notice, if they were.

================
Comment at: lib/Format/Utils.h:1
@@ +1,2 @@
+//===--- Utils.h - Format C++ code ----------------------------------------===//
+//
----------------
Manuel Klimek wrote:
> Daniel Jasper wrote:
> > Please don't call this "Utils", this is far too generic. How about "Encodings"? I think hex/octal escape sequences are also a kind of encoding ..
> +1 to not have utils. Alternatively, create two headers, one for encodings, and one for string literal related functions, and name them appropriately.
Utils.h -> Encoding.h with proper changes in all related names.

================
Comment at: lib/Format/FormatToken.h:96
@@ -94,3 +95,3 @@
   /// with the token.
   unsigned TokenLength;
 
----------------
Manuel Klimek wrote:
> Daniel Jasper wrote:
> > How about we make these slightly easier to understand and shorter?
> > 
> > What are the remaining usages of TokenLength? Would it make sense to rename that to "ByteCount"? And would it then make sense to rename CodePointCount to "TokenLength"? Or even just "Length" as we are in a class ..Token?
> I think I'd like to keep CodePointCount as it makes it really obvious we're not counting bytes. I'm also in favor of renaming TokenLength to ByteCount though, so the duality is more clear.
Good idea. Renaming TokenLength -> ByteCount discover a few more wrong usages, and led to cleaning up some code in TokenAnnotator.


http://llvm-reviews.chandlerc.com/D918