[PATCH] Handle zero-width and double-width characters in string literals and comments.

Thu Sep 5 01:58:14 PDT 2013


================
Comment at: lib/Format/Encoding.h:81
@@ +80,3 @@
+    TotalWidth += Width;
+    TotalWidth += 8 - (TotalWidth + StartColumn) % 8;
+    Text = Text.substr(TabPos + 1);
----------------
Note that tabs are already configurable. When UseTabs is true, we assume IndentWidth == tab width in chars.

================
Comment at: lib/Format/Encoding.h:75
@@ +74,3 @@
+  for (;;) {
+    StringRef::size_type TabPos = Text.find('\t');
+    if (TabPos == StringRef::npos)
----------------
I think the reuse of "Text" here is surprising. I'd probably create a variable "Tail" and work from that.

================
Comment at: lib/Format/Encoding.h:63
@@ +62,3 @@
+  if (Encoding == Encoding_UTF8) {
+    int ContentWidth = llvm::sys::unicode::columnWidthUTF8(Text);
+    if (ContentWidth >= 0)
----------------
Wait, wat? getCodePointCount already has a utf-8 special case. We only support utf-8. So why not just return Text.size() if this is not UTF-8?

================
Comment at: lib/Format/BreakableToken.cpp:337
@@ -330,4 +336,3 @@
     unsigned LineIndex, unsigned Offset, StringRef::size_type Length) const {
-  return getContentStartColumn(LineIndex, Offset) +
-         encoding::getCodePointCount(Lines[LineIndex].substr(Offset, Length),
-                                     Encoding) +
+  unsigned ContentStartColumn = getContentStartColumn(LineIndex, Offset);
+  return ContentStartColumn +
----------------
Any specific reason for outlining this? It doesn't look like it adds much... (not that I care too much ;)

================
Comment at: lib/Format/BreakableToken.cpp:53
@@ -54,1 +52,3 @@
+       NumChars < MaxSplit && MaxSplitBytes < Text.size();) {
+    unsigned NumBytes =
         encoding::getCodePointNumBytes(Text[MaxSplitBytes], Encoding);
----------------
I find it slightly confusing that NumBytes refers to *one* code point, while NumChars refers to the total number of characters.
How about naming NumBytes -> BytesForChar, or BytesInChar, or something like that?


http://llvm-reviews.chandlerc.com/D1601