[PATCH] UTF-8 support for clang-format.

Tue Jun 4 23:11:07 PDT 2013

  I agree that code points are the right thing to use (at least for now). There is one key advantage:

  As we are only breaking strings, not joining strings, clang-format will rarely do the wrong thing with correctly formatted code. Currently, if we encounter a unicode character, we end up breaking the string too early. This affects basically any long/multiline comment or string. With this patch, authors using double-width characters won't feel the joy of clang-format automatically breaking up their strings (in the right place), but once they have manually broken them, clang-format will still do the right thing. To be fair, clang-format might still do the wrong thing in situations like:

  SomeFunction("string with double-width characters would bring this close to column 80", AnotherParameter);

  However, I suspect those to be quite rare. And I agree with James that this might be a dangerous road to follow. After all, double-width characters are not always double-width. I have seen font-renderers using 1.5 columns, and then what?


================
Comment at: lib/Format/Utils.h:1
@@ +1,2 @@
+//===--- Utils.h - Format C++ code ----------------------------------------===//
+//
----------------
Please don't call this "Utils", this is far too generic. How about "Encodings"? I think hex/octal escape sequences are also a kind of encoding ..

================
Comment at: unittests/Format/FormatTest.cpp:4931
@@ +4930,3 @@
+
+TEST_F(FormatTest, SplitsUTF8BlockComments) {
+  EXPECT_EQ("/* Гляжу,\n"
----------------
If I am correct, the chinese letters are just numbers, I hope the russian characters don't mean anything offensive ;-)...

================
Comment at: lib/Format/FormatToken.h:96
@@ -94,3 +95,3 @@
   /// with the token.
   unsigned TokenLength;
 
----------------
How about we make these slightly easier to understand and shorter?

What are the remaining usages of TokenLength? Would it make sense to rename that to "ByteCount"? And would it then make sense to rename CodePointCount to "TokenLength"? Or even just "Length" as we are in a class ..Token?


http://llvm-reviews.chandlerc.com/D918