[PATCH] Handle zero-width and double-width characters in string literals and comments.

Thu Sep 5 06:07:23 PDT 2013


================
Comment at: lib/Format/BreakableToken.cpp:337
@@ -330,4 +336,3 @@
     unsigned LineIndex, unsigned Offset, StringRef::size_type Length) const {
-  return getContentStartColumn(LineIndex, Offset) +
-         encoding::getCodePointCount(Lines[LineIndex].substr(Offset, Length),
-                                     Encoding) +
+  unsigned ContentStartColumn = getContentStartColumn(LineIndex, Offset);
+  return ContentStartColumn +
----------------
Manuel Klimek wrote:
> Any specific reason for outlining this? It doesn't look like it adds much... (not that I care too much ;)
If you mean getContentStartColumn, it is used in two different methods. If you mean the local ContentStartColumn variable, it's also used twice in this method.

================
Comment at: lib/Format/BreakableToken.cpp:53
@@ -54,1 +52,3 @@
+       NumChars < MaxSplit && MaxSplitBytes < Text.size();) {
+    unsigned NumBytes =
         encoding::getCodePointNumBytes(Text[MaxSplitBytes], Encoding);
----------------
Manuel Klimek wrote:
> I find it slightly confusing that NumBytes refers to *one* code point, while NumChars refers to the total number of characters.
> How about naming NumBytes -> BytesForChar, or BytesInChar, or something like that?
Done.

================
Comment at: lib/Format/Encoding.h:63
@@ +62,3 @@
+  if (Encoding == Encoding_UTF8) {
+    int ContentWidth = llvm::sys::unicode::columnWidthUTF8(Text);
+    if (ContentWidth >= 0)
----------------
Manuel Klimek wrote:
> Wait, wat? getCodePointCount already has a utf-8 special case. We only support utf-8. So why not just return Text.size() if this is not UTF-8?
Err, I've missed this, as it became obvious only once I moved columnWidth close to getCodePointCount ;)

================
Comment at: lib/Format/Encoding.h:75
@@ +74,3 @@
+  for (;;) {
+    StringRef::size_type TabPos = Text.find('\t');
+    if (TabPos == StringRef::npos)
----------------
Manuel Klimek wrote:
> I think the reuse of "Text" here is surprising. I'd probably create a variable "Tail" and work from that.
If you think it's important, sure. Let it be Tail.

================
Comment at: lib/Format/Encoding.h:81
@@ +80,3 @@
+    TotalWidth += Width;
+    TotalWidth += 8 - (TotalWidth + StartColumn) % 8;
+    Text = Text.substr(TabPos + 1);
----------------
Manuel Klimek wrote:
> Note that tabs are already configurable. When UseTabs is true, we assume IndentWidth == tab width in chars.
This is a wrong way to handle tabs. Usually tab width is 8 characters, but when it's configurable, it's a separate option from the indentation width. I've added FormatStyle::TabWidth to address this, and corresponding tests.


http://llvm-reviews.chandlerc.com/D1601