r183312 - UTF-8 support for clang-format.

Fri Jun 7 08:46:10 PDT 2013

A bit more information: the issue that tracks support for UTF-8 BOM is
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415. The patch seems to be
present in GCC from version 4.4.0 on. LLVM claims to support GCC from 3.4
(with numerous exceptions):
http://llvm.org/docs/GettingStarted.html#software

So I'd better not introduce a change that would even more disappoint poor
guys who still have to deal with older GCC versions.

I'll leave it #ifdefed out until we move it to a file-based test.

On Fri, Jun 7, 2013 at 3:52 PM, Alexander Kornienko <alexfh at google.com>wrote:

> On Fri, Jun 7, 2013 at 6:46 AM, Nico Weber <thakis at chromium.org> wrote:
>
>> On Thu, Jun 6, 2013 at 4:49 PM, Alexander Kornienko <alexfh at google.com>wrote:
>>
>>> On Thu, Jun 6, 2013 at 1:11 AM, NAKAMURA Takumi <geek4civic at gmail.com>wrote:
>>>
>>>> I wonder the source file could contain utf8 characters.
>>>>
>>>
>>> It's implementation-defined behavior. Apparently GCC and Clang handle
>>> this correctly.
>>>
>>>
>>>> In fact, MS cl.exe misdetects charsets against rather system
>>>> codepage(932) than current codepage (65001), without BOM.
>>>>
>>>
>>> Seems like adding UTF-8 BOM is the only way to force MSVC treat a source
>>> file as UTF-8. But this is not supported by GCC and Clang, AFAIK.
>>>
>>
>> clang's Lexer::InitLexer() skips BOMs.
>>
>
> Sounds interesting. And here<http://stackoverflow.com/questions/7899795/is-it-possible-to-get-gcc-to-compile-utf-8-with-bom-source-files>they say that GCC also supports this. I've checked with Clang trunk and GCC
> 4.6.3, and it works. Then are there any reasons not to just add UTF-8 BOM?
>
>
>>
>>
>>>
>>> Could you get rid of raw utf8 characters and encode them in literals?
>>>> FYI, I can see Cyrillic and CJK :)
>>>>
>>>
>>> There's a plan to make some of our tests file-based instead of unit
>>> tests. I think, utf-8 tests are the first candidate for this. As UTF-8
>>> support is not the most important thing for Windows builds of clang-format,
>>> I'd leave the new tests just #ifdefed out for now. BTW, thanks for doing
>>> this.
>>>
>>>
>>>>
>>>> ...Takumi
>>>>
>>>> 2013/6/5 Alexander Kornienko <alexfh at google.com>:
>>>> > Author: alexfh
>>>> > Date: Wed Jun  5 09:09:10 2013
>>>> > New Revision: 183312
>>>> >
>>>> > URL: http://llvm.org/viewvc/llvm-project?rev=183312&view=rev
>>>> > Log:
>>>> > UTF-8 support for clang-format.
>>>> >
>>>> > Summary:
>>>> > Detect if the file is valid UTF-8, and if this is the case, count code
>>>> > points instead of just using number of bytes in all (hopefully)
>>>> places, where
>>>> > number of columns is needed. In particular, use the new
>>>> > FormatToken.CodePointCount instead of TokenLength where appropriate.
>>>> > Changed BreakableToken implementations to respect utf-8 character
>>>> boundaries
>>>> > when in utf-8 mode.
>>>> >
>>>> > Reviewers: klimek, djasper
>>>> >
>>>> > Reviewed By: djasper
>>>> >
>>>> > CC: cfe-commits, rsmith, gribozavr
>>>> >
>>>> > Differential Revision: http://llvm-reviews.chandlerc.com/D918
>>>> >
>>>> > Added:
>>>> >     cfe/trunk/lib/Format/Encoding.h
>>>> > Modified:
>>>> >     cfe/trunk/lib/Format/BreakableToken.cpp
>>>> >     cfe/trunk/lib/Format/BreakableToken.h
>>>> >     cfe/trunk/lib/Format/Format.cpp
>>>> >     cfe/trunk/lib/Format/FormatToken.h
>>>> >     cfe/trunk/lib/Format/TokenAnnotator.cpp
>>>> >     cfe/trunk/lib/Format/TokenAnnotator.h
>>>> >     cfe/trunk/unittests/Format/FormatTest.cpp
>>>> >
>>>> > Modified: cfe/trunk/lib/Format/BreakableToken.cpp
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.cpp?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/BreakableToken.cpp (original)
>>>> > +++ cfe/trunk/lib/Format/BreakableToken.cpp Wed Jun  5 09:09:10 2013
>>>> > @@ -25,66 +25,22 @@ namespace clang {
>>>> >  namespace format {
>>>> >  namespace {
>>>> >
>>>> > -// FIXME: Move helper string functions to where it makes sense.
>>>> > -
>>>> > -unsigned getOctalLength(StringRef Text) {
>>>> > -  unsigned I = 1;
>>>> > -  while (I < Text.size() && I < 4 && (Text[I] >= '0' && Text[I] <=
>>>> '7')) {
>>>> > -    ++I;
>>>> > -  }
>>>> > -  return I;
>>>> > -}
>>>> > -
>>>> > -unsigned getHexLength(StringRef Text) {
>>>> > -  unsigned I = 2; // Point after '\x'.
>>>> > -  while (I < Text.size() && ((Text[I] >= '0' && Text[I] <= '9') ||
>>>> > -                             (Text[I] >= 'a' && Text[I] <= 'f') ||
>>>> > -                             (Text[I] >= 'A' && Text[I] <= 'F'))) {
>>>> > -    ++I;
>>>> > -  }
>>>> > -  return I;
>>>> > -}
>>>> > -
>>>> > -unsigned getEscapeSequenceLength(StringRef Text) {
>>>> > -  assert(Text[0] == '\\');
>>>> > -  if (Text.size() < 2)
>>>> > -    return 1;
>>>> > -
>>>> > -  switch (Text[1]) {
>>>> > -  case 'u':
>>>> > -    return 6;
>>>> > -  case 'U':
>>>> > -    return 10;
>>>> > -  case 'x':
>>>> > -    return getHexLength(Text);
>>>> > -  default:
>>>> > -    if (Text[1] >= '0' && Text[1] <= '7')
>>>> > -      return getOctalLength(Text);
>>>> > -    return 2;
>>>> > -  }
>>>> > -}
>>>> > -
>>>> > -StringRef::size_type getStartOfCharacter(StringRef Text,
>>>> > -                                         StringRef::size_type
>>>> Offset) {
>>>> > -  StringRef::size_type NextEscape = Text.find('\\');
>>>> > -  while (NextEscape != StringRef::npos && NextEscape < Offset) {
>>>> > -    StringRef::size_type SequenceLength =
>>>> > -        getEscapeSequenceLength(Text.substr(NextEscape));
>>>> > -    if (Offset < NextEscape + SequenceLength)
>>>> > -      return NextEscape;
>>>> > -    NextEscape = Text.find('\\', NextEscape + SequenceLength);
>>>> > -  }
>>>> > -  return Offset;
>>>> > -}
>>>> > -
>>>> >  BreakableToken::Split getCommentSplit(StringRef Text,
>>>> >                                        unsigned ContentStartColumn,
>>>> > -                                      unsigned ColumnLimit) {
>>>> > +                                      unsigned ColumnLimit,
>>>> > +                                      encoding::Encoding Encoding) {
>>>> >    if (ColumnLimit <= ContentStartColumn + 1)
>>>> >      return BreakableToken::Split(StringRef::npos, 0);
>>>> >
>>>> >    unsigned MaxSplit = ColumnLimit - ContentStartColumn + 1;
>>>> > -  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);
>>>> > +  unsigned MaxSplitBytes = 0;
>>>> > +
>>>> > +  for (unsigned NumChars = 0;
>>>> > +       NumChars < MaxSplit && MaxSplitBytes < Text.size();
>>>> ++NumChars)
>>>> > +    MaxSplitBytes +=
>>>> > +        encoding::getCodePointNumBytes(Text[MaxSplitBytes],
>>>> Encoding);
>>>> > +
>>>> > +  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplitBytes);
>>>> >    if (SpaceOffset == StringRef::npos ||
>>>> >        // Don't break at leading whitespace.
>>>> >        Text.find_last_not_of(' ', SpaceOffset) == StringRef::npos) {
>>>> > @@ -95,7 +51,7 @@ BreakableToken::Split getCommentSplit(St
>>>> >        // If the comment is only whitespace, we cannot split.
>>>> >        return BreakableToken::Split(StringRef::npos, 0);
>>>> >      SpaceOffset =
>>>> > -        Text.find(' ', std::max<unsigned>(MaxSplit,
>>>> FirstNonWhitespace));
>>>> > +        Text.find(' ', std::max<unsigned>(MaxSplitBytes,
>>>> FirstNonWhitespace));
>>>> >    }
>>>> >    if (SpaceOffset != StringRef::npos && SpaceOffset != 0) {
>>>> >      StringRef BeforeCut = Text.substr(0, SpaceOffset).rtrim();
>>>> > @@ -108,25 +64,48 @@ BreakableToken::Split getCommentSplit(St
>>>> >
>>>> >  BreakableToken::Split getStringSplit(StringRef Text,
>>>> >                                       unsigned ContentStartColumn,
>>>> > -                                     unsigned ColumnLimit) {
>>>> > -
>>>> > -  if (ColumnLimit <= ContentStartColumn)
>>>> > -    return BreakableToken::Split(StringRef::npos, 0);
>>>> > -  unsigned MaxSplit = ColumnLimit - ContentStartColumn;
>>>> > +                                     unsigned ColumnLimit,
>>>> > +                                     encoding::Encoding Encoding) {
>>>> >    // FIXME: Reduce unit test case.
>>>> >    if (Text.empty())
>>>> >      return BreakableToken::Split(StringRef::npos, 0);
>>>> > -  MaxSplit = std::min<unsigned>(MaxSplit, Text.size() - 1);
>>>> > -  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);
>>>> > -  if (SpaceOffset != StringRef::npos && SpaceOffset != 0)
>>>> > +  if (ColumnLimit <= ContentStartColumn)
>>>> > +    return BreakableToken::Split(StringRef::npos, 0);
>>>> > +  unsigned MaxSplit =
>>>> > +      std::min<unsigned>(ColumnLimit - ContentStartColumn,
>>>> > +                         encoding::getCodePointCount(Text, Encoding)
>>>> - 1);
>>>> > +  StringRef::size_type SpaceOffset = 0;
>>>> > +  StringRef::size_type SlashOffset = 0;
>>>> > +  StringRef::size_type SplitPoint = 0;
>>>> > +  for (unsigned Chars = 0;;) {
>>>> > +    unsigned Advance;
>>>> > +    if (Text[0] == '\\') {
>>>> > +      Advance = encoding::getEscapeSequenceLength(Text);
>>>> > +      Chars += Advance;
>>>> > +    } else {
>>>> > +      Advance = encoding::getCodePointNumBytes(Text[0], Encoding);
>>>> > +      Chars += 1;
>>>> > +    }
>>>> > +
>>>> > +    if (Chars > MaxSplit)
>>>> > +      break;
>>>> > +
>>>> > +    if (Text[0] == ' ')
>>>> > +      SpaceOffset = SplitPoint;
>>>> > +    if (Text[0] == '/')
>>>> > +      SlashOffset = SplitPoint;
>>>> > +
>>>> > +    SplitPoint += Advance;
>>>> > +    Text = Text.substr(Advance);
>>>> > +  }
>>>> > +
>>>> > +  if (SpaceOffset != 0)
>>>> >      return BreakableToken::Split(SpaceOffset + 1, 0);
>>>> > -  StringRef::size_type SlashOffset = Text.rfind('/', MaxSplit);
>>>> > -  if (SlashOffset != StringRef::npos && SlashOffset != 0)
>>>> > +  if (SlashOffset != 0)
>>>> >      return BreakableToken::Split(SlashOffset + 1, 0);
>>>> > -  StringRef::size_type SplitPoint = getStartOfCharacter(Text,
>>>> MaxSplit);
>>>> > -  if (SplitPoint == StringRef::npos || SplitPoint == 0)
>>>> > -    return BreakableToken::Split(StringRef::npos, 0);
>>>> > -  return BreakableToken::Split(SplitPoint, 0);
>>>> > +  if (SplitPoint != 0)
>>>> > +    return BreakableToken::Split(SplitPoint, 0);
>>>> > +  return BreakableToken::Split(StringRef::npos, 0);
>>>> >  }
>>>> >
>>>> >  } // namespace
>>>> > @@ -136,8 +115,8 @@ unsigned BreakableSingleLineToken::getLi
>>>> >  unsigned
>>>> >  BreakableSingleLineToken::getLineLengthAfterSplit(unsigned LineIndex,
>>>> >                                                    unsigned
>>>> TailOffset) const {
>>>> > -  return StartColumn + Prefix.size() + Postfix.size() + Line.size() -
>>>> > -         TailOffset;
>>>> > +  return StartColumn + Prefix.size() + Postfix.size() +
>>>> > +         encoding::getCodePointCount(Line.substr(TailOffset),
>>>> Encoding);
>>>> >  }
>>>> >
>>>> >  void BreakableSingleLineToken::insertBreak(unsigned LineIndex,
>>>> > @@ -152,8 +131,9 @@ void BreakableSingleLineToken::insertBre
>>>> >  BreakableSingleLineToken::BreakableSingleLineToken(const FormatToken
>>>> &Tok,
>>>> >                                                     unsigned
>>>> StartColumn,
>>>> >                                                     StringRef Prefix,
>>>> > -                                                   StringRef Postfix)
>>>> > -    : BreakableToken(Tok), StartColumn(StartColumn), Prefix(Prefix),
>>>> > +                                                   StringRef Postfix,
>>>> > +
>>>> encoding::Encoding Encoding)
>>>> > +    : BreakableToken(Tok, Encoding), StartColumn(StartColumn),
>>>> Prefix(Prefix),
>>>> >        Postfix(Postfix) {
>>>> >    assert(Tok.TokenText.startswith(Prefix) &&
>>>> Tok.TokenText.endswith(Postfix));
>>>> >    Line = Tok.TokenText.substr(
>>>> > @@ -161,13 +141,15 @@ BreakableSingleLineToken::BreakableSingl
>>>> >  }
>>>> >
>>>> >  BreakableStringLiteral::BreakableStringLiteral(const FormatToken
>>>> &Tok,
>>>> > -                                               unsigned StartColumn)
>>>> > -    : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"") {}
>>>> > +                                               unsigned StartColumn,
>>>> > +                                               encoding::Encoding
>>>> Encoding)
>>>> > +    : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"",
>>>> Encoding) {}
>>>> >
>>>> >  BreakableToken::Split
>>>> >  BreakableStringLiteral::getSplit(unsigned LineIndex, unsigned
>>>> TailOffset,
>>>> >                                   unsigned ColumnLimit) const {
>>>> > -  return getStringSplit(Line.substr(TailOffset), StartColumn + 2,
>>>> ColumnLimit);
>>>> > +  return getStringSplit(Line.substr(TailOffset), StartColumn + 2,
>>>> ColumnLimit,
>>>> > +                        Encoding);
>>>> >  }
>>>> >
>>>> >  static StringRef getLineCommentPrefix(StringRef Comment) {
>>>> > @@ -179,23 +161,23 @@ static StringRef getLineCommentPrefix(St
>>>> >  }
>>>> >
>>>> >  BreakableLineComment::BreakableLineComment(const FormatToken &Token,
>>>> > -                                           unsigned StartColumn)
>>>> > +                                           unsigned StartColumn,
>>>> > +                                           encoding::Encoding
>>>> Encoding)
>>>> >      : BreakableSingleLineToken(Token, StartColumn,
>>>> > -
>>>> getLineCommentPrefix(Token.TokenText), "") {}
>>>> > +
>>>> getLineCommentPrefix(Token.TokenText), "",
>>>> > +                               Encoding) {}
>>>> >
>>>> >  BreakableToken::Split
>>>> >  BreakableLineComment::getSplit(unsigned LineIndex, unsigned
>>>> TailOffset,
>>>> >                                 unsigned ColumnLimit) const {
>>>> >    return getCommentSplit(Line.substr(TailOffset), StartColumn +
>>>> Prefix.size(),
>>>> > -                         ColumnLimit);
>>>> > +                         ColumnLimit, Encoding);
>>>> >  }
>>>> >
>>>> > -BreakableBlockComment::BreakableBlockComment(const FormatStyle
>>>> &Style,
>>>> > -                                             const FormatToken
>>>> &Token,
>>>> > -                                             unsigned StartColumn,
>>>> > -                                             unsigned
>>>> OriginalStartColumn,
>>>> > -                                             bool FirstInLine)
>>>> > -    : BreakableToken(Token) {
>>>> > +BreakableBlockComment::BreakableBlockComment(
>>>> > +    const FormatStyle &Style, const FormatToken &Token, unsigned
>>>> StartColumn,
>>>> > +    unsigned OriginalStartColumn, bool FirstInLine,
>>>> encoding::Encoding Encoding)
>>>> > +    : BreakableToken(Token, Encoding) {
>>>> >    StringRef TokenText(Token.TokenText);
>>>> >    assert(TokenText.startswith("/*") && TokenText.endswith("*/"));
>>>> >    TokenText.substr(2, TokenText.size() - 4).split(Lines, "\n");
>>>> > @@ -290,7 +272,8 @@ unsigned
>>>> >  BreakableBlockComment::getLineLengthAfterSplit(unsigned LineIndex,
>>>> >                                                 unsigned TailOffset)
>>>> const {
>>>> >    return getContentStartColumn(LineIndex, TailOffset) +
>>>> > -         (Lines[LineIndex].size() - TailOffset) +
>>>> > +
>>>> encoding::getCodePointCount(Lines[LineIndex].substr(TailOffset),
>>>> > +                                     Encoding) +
>>>> >           // The last line gets a "*/" postfix.
>>>> >           (LineIndex + 1 == Lines.size() ? 2 : 0);
>>>> >  }
>>>> > @@ -300,7 +283,7 @@ BreakableBlockComment::getSplit(unsigned
>>>> >                                  unsigned ColumnLimit) const {
>>>> >    return getCommentSplit(Lines[LineIndex].substr(TailOffset),
>>>> >                           getContentStartColumn(LineIndex,
>>>> TailOffset),
>>>> > -                         ColumnLimit);
>>>> > +                         ColumnLimit, Encoding);
>>>> >  }
>>>> >
>>>> >  void BreakableBlockComment::insertBreak(unsigned LineIndex, unsigned
>>>> TailOffset,
>>>> >
>>>> > Modified: cfe/trunk/lib/Format/BreakableToken.h
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.h?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/BreakableToken.h (original)
>>>> > +++ cfe/trunk/lib/Format/BreakableToken.h Wed Jun  5 09:09:10 2013
>>>> > @@ -17,6 +17,7 @@
>>>> >  #ifndef LLVM_CLANG_FORMAT_BREAKABLETOKEN_H
>>>> >  #define LLVM_CLANG_FORMAT_BREAKABLETOKEN_H
>>>> >
>>>> > +#include "Encoding.h"
>>>> >  #include "TokenAnnotator.h"
>>>> >  #include "WhitespaceManager.h"
>>>> >  #include <utility>
>>>> > @@ -65,9 +66,11 @@ public:
>>>> >                                         WhitespaceManager
>>>> &Whitespaces) {}
>>>> >
>>>> >  protected:
>>>> > -  BreakableToken(const FormatToken &Tok) : Tok(Tok) {}
>>>> > +  BreakableToken(const FormatToken &Tok, encoding::Encoding Encoding)
>>>> > +      : Tok(Tok), Encoding(Encoding) {}
>>>> >
>>>> >    const FormatToken &Tok;
>>>> > +  encoding::Encoding Encoding;
>>>> >  };
>>>> >
>>>> >  /// \brief Base class for single line tokens that can be broken.
>>>> > @@ -83,7 +86,8 @@ public:
>>>> >
>>>> >  protected:
>>>> >    BreakableSingleLineToken(const FormatToken &Tok, unsigned
>>>> StartColumn,
>>>> > -                           StringRef Prefix, StringRef Postfix);
>>>> > +                           StringRef Prefix, StringRef Postfix,
>>>> > +                           encoding::Encoding Encoding);
>>>> >
>>>> >    // The column in which the token starts.
>>>> >    unsigned StartColumn;
>>>> > @@ -101,7 +105,8 @@ public:
>>>> >    ///
>>>> >    /// \p StartColumn specifies the column in which the token will
>>>> start
>>>> >    /// after formatting.
>>>> > -  BreakableStringLiteral(const FormatToken &Tok, unsigned
>>>> StartColumn);
>>>> > +  BreakableStringLiteral(const FormatToken &Tok, unsigned
>>>> StartColumn,
>>>> > +                         encoding::Encoding Encoding);
>>>> >
>>>> >    virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,
>>>> >                           unsigned ColumnLimit) const;
>>>> > @@ -113,7 +118,8 @@ public:
>>>> >    ///
>>>> >    /// \p StartColumn specifies the column in which the comment will
>>>> start
>>>> >    /// after formatting.
>>>> > -  BreakableLineComment(const FormatToken &Token, unsigned
>>>> StartColumn);
>>>> > +  BreakableLineComment(const FormatToken &Token, unsigned
>>>> StartColumn,
>>>> > +                       encoding::Encoding Encoding);
>>>> >
>>>> >    virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,
>>>> >                           unsigned ColumnLimit) const;
>>>> > @@ -129,7 +135,7 @@ public:
>>>> >    /// If the comment starts a line after formatting, set \p
>>>> FirstInLine to true.
>>>> >    BreakableBlockComment(const FormatStyle &Style, const FormatToken
>>>> &Token,
>>>> >                          unsigned StartColumn, unsigned
>>>> OriginaStartColumn,
>>>> > -                        bool FirstInLine);
>>>> > +                        bool FirstInLine, encoding::Encoding
>>>> Encoding);
>>>> >
>>>> >    virtual unsigned getLineCount() const;
>>>> >    virtual unsigned getLineLengthAfterSplit(unsigned LineIndex,
>>>> >
>>>> > Added: cfe/trunk/lib/Format/Encoding.h
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Encoding.h?rev=183312&view=auto
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/Encoding.h (added)
>>>> > +++ cfe/trunk/lib/Format/Encoding.h Wed Jun  5 09:09:10 2013
>>>> > @@ -0,0 +1,114 @@
>>>> > +//===--- Encoding.h - Format C++ code
>>>> -------------------------------------===//
>>>> > +//
>>>> > +//                     The LLVM Compiler Infrastructure
>>>> > +//
>>>> > +// This file is distributed under the University of Illinois Open
>>>> Source
>>>> > +// License. See LICENSE.TXT for details.
>>>> > +//
>>>> >
>>>> +//===----------------------------------------------------------------------===//
>>>> > +///
>>>> > +/// \file
>>>> > +/// \brief Contains functions for text encoding manipulation.
>>>> Supports UTF-8,
>>>> > +/// 8-bit encodings and escape sequences in C++ string literals.
>>>> > +///
>>>> >
>>>> +//===----------------------------------------------------------------------===//
>>>> > +
>>>> > +#ifndef LLVM_CLANG_FORMAT_ENCODING_H
>>>> > +#define LLVM_CLANG_FORMAT_ENCODING_H
>>>> > +
>>>> > +#include "clang/Basic/LLVM.h"
>>>> > +#include "llvm/Support/ConvertUTF.h"
>>>> > +
>>>> > +namespace clang {
>>>> > +namespace format {
>>>> > +namespace encoding {
>>>> > +
>>>> > +enum Encoding {
>>>> > +  Encoding_UTF8,
>>>> > +  Encoding_Unknown // We treat all other encodings as 8-bit
>>>> encodings.
>>>> > +};
>>>> > +
>>>> > +/// \brief Detects encoding of the Text. If the Text can be decoded
>>>> using UTF-8,
>>>> > +/// it is considered UTF8, otherwise we treat it as some 8-bit
>>>> encoding.
>>>> > +inline Encoding detectEncoding(StringRef Text) {
>>>> > +  const UTF8 *Ptr = reinterpret_cast<const UTF8 *>(Text.begin());
>>>> > +  const UTF8 *BufEnd = reinterpret_cast<const UTF8 *>(Text.end());
>>>> > +  if (::isLegalUTF8String(&Ptr, BufEnd))
>>>> > +    return Encoding_UTF8;
>>>> > +  return Encoding_Unknown;
>>>> > +}
>>>> > +
>>>> > +inline unsigned getCodePointCountUTF8(StringRef Text) {
>>>> > +  unsigned CodePoints = 0;
>>>> > +  for (size_t i = 0, e = Text.size(); i < e; i +=
>>>> getNumBytesForUTF8(Text[i])) {
>>>> > +    ++CodePoints;
>>>> > +  }
>>>> > +  return CodePoints;
>>>> > +}
>>>> > +
>>>> > +/// \brief Gets the number of code points in the Text using the
>>>> specified
>>>> > +/// Encoding.
>>>> > +inline unsigned getCodePointCount(StringRef Text, Encoding Encoding)
>>>> {
>>>> > +  switch (Encoding) {
>>>> > +    case Encoding_UTF8:
>>>> > +      return getCodePointCountUTF8(Text);
>>>> > +    default:
>>>> > +      return Text.size();
>>>> > +  }
>>>> > +}
>>>> > +
>>>> > +/// \brief Gets the number of bytes in a sequence representing a
>>>> single
>>>> > +/// codepoint and starting with FirstChar in the specified Encoding.
>>>> > +inline unsigned getCodePointNumBytes(char FirstChar, Encoding
>>>> Encoding) {
>>>> > +  switch (Encoding) {
>>>> > +    case Encoding_UTF8:
>>>> > +      return getNumBytesForUTF8(FirstChar);
>>>> > +    default:
>>>> > +      return 1;
>>>> > +  }
>>>> > +}
>>>> > +
>>>> > +inline bool isOctDigit(char c) {
>>>> > +  return '0' <= c && c <= '7';
>>>> > +}
>>>> > +
>>>> > +inline bool isHexDigit(char c) {
>>>> > +  return ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||
>>>> > +         ('A' <= c && c <= 'F');
>>>> > +}
>>>> > +
>>>> > +/// \brief Gets the length of an escape sequence inside a C++ string
>>>> literal.
>>>> > +/// Text should span from the beginning of the escape sequence
>>>> (starting with a
>>>> > +/// backslash) to the end of the string literal.
>>>> > +inline unsigned getEscapeSequenceLength(StringRef Text) {
>>>> > +  assert(Text[0] == '\\');
>>>> > +  if (Text.size() < 2)
>>>> > +    return 1;
>>>> > +
>>>> > +  switch (Text[1]) {
>>>> > +  case 'u':
>>>> > +    return 6;
>>>> > +  case 'U':
>>>> > +    return 10;
>>>> > +  case 'x': {
>>>> > +    unsigned I = 2; // Point after '\x'.
>>>> > +    while (I < Text.size() && isHexDigit(Text[I]))
>>>> > +      ++I;
>>>> > +    return I;
>>>> > +  }
>>>> > +  default:
>>>> > +    if (isOctDigit(Text[1])) {
>>>> > +      unsigned I = 1;
>>>> > +      while (I < Text.size() && I < 4 && isOctDigit(Text[I]))
>>>> > +        ++I;
>>>> > +      return I;
>>>> > +    }
>>>> > +    return 2;
>>>> > +  }
>>>> > +}
>>>> > +
>>>> > +} // namespace encoding
>>>> > +} // namespace format
>>>> > +} // namespace clang
>>>> > +
>>>> > +#endif // LLVM_CLANG_FORMAT_ENCODING_H
>>>> >
>>>> > Modified: cfe/trunk/lib/Format/Format.cpp
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Format.cpp?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/Format.cpp (original)
>>>> > +++ cfe/trunk/lib/Format/Format.cpp Wed Jun  5 09:09:10 2013
>>>> > @@ -243,10 +243,11 @@ public:
>>>> >    UnwrappedLineFormatter(const FormatStyle &Style, SourceManager
>>>> &SourceMgr,
>>>> >                           const AnnotatedLine &Line, unsigned
>>>> FirstIndent,
>>>> >                           const FormatToken *RootToken,
>>>> > -                         WhitespaceManager &Whitespaces)
>>>> > +                         WhitespaceManager &Whitespaces,
>>>> > +                         encoding::Encoding Encoding)
>>>> >        : Style(Style), SourceMgr(SourceMgr), Line(Line),
>>>> >          FirstIndent(FirstIndent), RootToken(RootToken),
>>>> > -        Whitespaces(Whitespaces), Count(0) {}
>>>> > +        Whitespaces(Whitespaces), Count(0), Encoding(Encoding) {}
>>>> >
>>>> >    /// \brief Formats an \c UnwrappedLine.
>>>> >    void format(const AnnotatedLine *NextLine) {
>>>> > @@ -484,7 +485,7 @@ private:
>>>> >
>>>> State.NextToken->WhitespaceRange.getEnd()) -
>>>> >                               SourceMgr.getSpellingColumnNumber(
>>>> >
>>>> State.NextToken->WhitespaceRange.getBegin());
>>>> > -      State.Column += WhitespaceLength +
>>>> State.NextToken->TokenLength;
>>>> > +      State.Column += WhitespaceLength +
>>>> State.NextToken->CodePointCount;
>>>> >        State.NextToken = State.NextToken->Next;
>>>> >        return 0;
>>>> >      }
>>>> > @@ -520,11 +521,11 @@ private:
>>>> >                    Line.StartsDefinition)) {
>>>> >          State.Column = State.Stack.back().Indent;
>>>> >        } else if (Current.Type == TT_ObjCSelectorName) {
>>>> > -        if (State.Stack.back().ColonPos > Current.TokenLength) {
>>>> > -          State.Column = State.Stack.back().ColonPos -
>>>> Current.TokenLength;
>>>> > +        if (State.Stack.back().ColonPos > Current.CodePointCount) {
>>>> > +          State.Column = State.Stack.back().ColonPos -
>>>> Current.CodePointCount;
>>>> >          } else {
>>>> >            State.Column = State.Stack.back().Indent;
>>>> > -          State.Stack.back().ColonPos = State.Column +
>>>> Current.TokenLength;
>>>> > +          State.Stack.back().ColonPos = State.Column +
>>>> Current.CodePointCount;
>>>> >          }
>>>> >        } else if (Current.Type == TT_StartOfName ||
>>>> >                   Previous.isOneOf(tok::coloncolon, tok::equal) ||
>>>> > @@ -560,7 +561,7 @@ private:
>>>> >        State.Stack.back().LastSpace = State.Column;
>>>> >        if (Current.isOneOf(tok::arrow, tok::period) &&
>>>> >            Current.Type != TT_DesignatedInitializerPeriod)
>>>> > -        State.Stack.back().LastSpace += Current.TokenLength;
>>>> > +        State.Stack.back().LastSpace += Current.CodePointCount;
>>>> >        State.StartOfLineLevel = State.ParenLevel;
>>>> >        State.LowestCallLevel = State.ParenLevel;
>>>> >
>>>> > @@ -595,8 +596,8 @@ private:
>>>> >          State.Stack.back().VariablePos = State.Column;
>>>> >          // Move over * and & if they are bound to the variable name.
>>>> >          const FormatToken *Tok = &Previous;
>>>> > -        while (Tok && State.Stack.back().VariablePos >=
>>>> Tok->TokenLength) {
>>>> > -          State.Stack.back().VariablePos -= Tok->TokenLength;
>>>> > +        while (Tok && State.Stack.back().VariablePos >=
>>>> Tok->CodePointCount) {
>>>> > +          State.Stack.back().VariablePos -= Tok->CodePointCount;
>>>> >            if (Tok->SpacesRequiredBefore != 0)
>>>> >              break;
>>>> >            Tok = Tok->Previous;
>>>> > @@ -614,12 +615,12 @@ private:
>>>> >        if (Current.Type == TT_ObjCSelectorName &&
>>>> >            State.Stack.back().ColonPos == 0) {
>>>> >          if (State.Stack.back().Indent +
>>>> Current.LongestObjCSelectorName >
>>>> > -            State.Column + Spaces + Current.TokenLength)
>>>> > +            State.Column + Spaces + Current.CodePointCount)
>>>> >            State.Stack.back().ColonPos =
>>>> >                State.Stack.back().Indent +
>>>> Current.LongestObjCSelectorName;
>>>> >          else
>>>> >            State.Stack.back().ColonPos =
>>>> > -              State.Column + Spaces + Current.TokenLength;
>>>> > +              State.Column + Spaces + Current.CodePointCount;
>>>> >        }
>>>> >
>>>> >        if (Previous.opensScope() && Previous.Type !=
>>>> TT_ObjCMethodExpr &&
>>>> > @@ -671,7 +672,8 @@ private:
>>>> >        State.LowestCallLevel = std::min(State.LowestCallLevel,
>>>> State.ParenLevel);
>>>> >        if (Line.Type == LT_BuilderTypeCall && State.ParenLevel == 0)
>>>> >          State.Stack.back().StartOfFunctionCall =
>>>> > -            Current.LastInChainOfCalls ? 0 : State.Column +
>>>> Current.TokenLength;
>>>> > +            Current.LastInChainOfCalls ? 0
>>>> > +                                       : State.Column +
>>>> Current.CodePointCount;
>>>> >      }
>>>> >      if (Current.Type == TT_CtorInitializerColon) {
>>>> >        // Indent 2 from the column, so:
>>>> > @@ -779,7 +781,7 @@ private:
>>>> >        State.StartOfStringLiteral = 0;
>>>> >      }
>>>> >
>>>> > -    State.Column += Current.TokenLength;
>>>> > +    State.Column += Current.CodePointCount;
>>>> >
>>>> >      State.NextToken = State.NextToken->Next;
>>>> >
>>>> > @@ -798,7 +800,7 @@ private:
>>>> >                                  bool DryRun) {
>>>> >      unsigned UnbreakableTailLength = Current.UnbreakableTailLength;
>>>> >      llvm::OwningPtr<BreakableToken> Token;
>>>> > -    unsigned StartColumn = State.Column - Current.TokenLength;
>>>> > +    unsigned StartColumn = State.Column - Current.CodePointCount;
>>>> >      unsigned OriginalStartColumn =
>>>> >
>>>>  SourceMgr.getSpellingColumnNumber(Current.getStartOfNonWhitespace()) -
>>>> >          1;
>>>> > @@ -811,15 +813,16 @@ private:
>>>> >        if (!LiteralData || *LiteralData != '"')
>>>> >          return 0;
>>>> >
>>>> > -      Token.reset(new BreakableStringLiteral(Current, StartColumn));
>>>> > +      Token.reset(new BreakableStringLiteral(Current, StartColumn,
>>>> Encoding));
>>>> >      } else if (Current.Type == TT_BlockComment) {
>>>> >        BreakableBlockComment *BBC = new BreakableBlockComment(
>>>> > -          Style, Current, StartColumn, OriginalStartColumn,
>>>> !Current.Previous);
>>>> > +          Style, Current, StartColumn, OriginalStartColumn,
>>>> !Current.Previous,
>>>> > +          Encoding);
>>>> >        Token.reset(BBC);
>>>> >      } else if (Current.Type == TT_LineComment &&
>>>> >                 (Current.Previous == NULL ||
>>>> >                  Current.Previous->Type != TT_ImplicitStringLiteral))
>>>> {
>>>> > -      Token.reset(new BreakableLineComment(Current, StartColumn));
>>>> > +      Token.reset(new BreakableLineComment(Current, StartColumn,
>>>> Encoding));
>>>> >      } else {
>>>> >        return 0;
>>>> >      }
>>>> > @@ -837,27 +840,27 @@ private:
>>>> >                                         Whitespaces);
>>>> >        }
>>>> >        unsigned TailOffset = 0;
>>>> > -      unsigned RemainingTokenLength =
>>>> > +      unsigned RemainingTokenColumns =
>>>> >            Token->getLineLengthAfterSplit(LineIndex, TailOffset);
>>>> > -      while (RemainingTokenLength > RemainingSpace) {
>>>> > +      while (RemainingTokenColumns > RemainingSpace) {
>>>> >          BreakableToken::Split Split =
>>>> >              Token->getSplit(LineIndex, TailOffset, getColumnLimit());
>>>> >          if (Split.first == StringRef::npos)
>>>> >            break;
>>>> >          assert(Split.first != 0);
>>>> > -        unsigned NewRemainingTokenLength =
>>>> Token->getLineLengthAfterSplit(
>>>> > +        unsigned NewRemainingTokenColumns =
>>>> Token->getLineLengthAfterSplit(
>>>> >              LineIndex, TailOffset + Split.first + Split.second);
>>>> > -        assert(NewRemainingTokenLength < RemainingTokenLength);
>>>> > +        assert(NewRemainingTokenColumns < RemainingTokenColumns);
>>>> >          if (!DryRun) {
>>>> >            Token->insertBreak(LineIndex, TailOffset, Split,
>>>> Line.InPPDirective,
>>>> >                               Whitespaces);
>>>> >          }
>>>> >          TailOffset += Split.first + Split.second;
>>>> > -        RemainingTokenLength = NewRemainingTokenLength;
>>>> > +        RemainingTokenColumns = NewRemainingTokenColumns;
>>>> >          Penalty += Style.PenaltyExcessCharacter;
>>>> >          BreakInserted = true;
>>>> >        }
>>>> > -      PositionAfterLastLineInToken = RemainingTokenLength;
>>>> > +      PositionAfterLastLineInToken = RemainingTokenColumns;
>>>> >      }
>>>> >
>>>> >      if (BreakInserted) {
>>>> > @@ -1080,13 +1083,16 @@ private:
>>>> >    // Increasing count of \c StateNode items we have created. This is
>>>> used
>>>> >    // to create a deterministic order independent of the container.
>>>> >    unsigned Count;
>>>> > +  encoding::Encoding Encoding;
>>>> >  };
>>>> >
>>>> >  class FormatTokenLexer {
>>>> >  public:
>>>> > -  FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr)
>>>> > +  FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr,
>>>> > +                   encoding::Encoding Encoding)
>>>> >        : FormatTok(NULL), GreaterStashed(false),
>>>> TrailingWhitespace(0), Lex(Lex),
>>>> > -        SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()) {
>>>> > +        SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()),
>>>> > +        Encoding(Encoding) {
>>>> >      Lex.SetKeepWhitespaceMode(true);
>>>> >    }
>>>> >
>>>> > @@ -1111,7 +1117,8 @@ private:
>>>> >            FormatTok->Tok.getLocation().getLocWithOffset(1);
>>>> >        FormatTok->WhitespaceRange =
>>>> >            SourceRange(GreaterLocation, GreaterLocation);
>>>> > -      FormatTok->TokenLength = 1;
>>>> > +      FormatTok->ByteCount = 1;
>>>> > +      FormatTok->CodePointCount = 1;
>>>> >        GreaterStashed = false;
>>>> >        return FormatTok;
>>>> >      }
>>>> > @@ -1146,12 +1153,12 @@ private:
>>>> >      }
>>>> >
>>>> >      // Now FormatTok is the next non-whitespace token.
>>>> > -    FormatTok->TokenLength = Text.size();
>>>> > +    FormatTok->ByteCount = Text.size();
>>>> >
>>>> >      TrailingWhitespace = 0;
>>>> >      if (FormatTok->Tok.is(tok::comment)) {
>>>> >        TrailingWhitespace = Text.size() - Text.rtrim().size();
>>>> > -      FormatTok->TokenLength -= TrailingWhitespace;
>>>> > +      FormatTok->ByteCount -= TrailingWhitespace;
>>>> >      }
>>>> >
>>>> >      // In case the token starts with escaped newlines, we want to
>>>> > @@ -1164,7 +1171,7 @@ private:
>>>> >      while (i + 1 < Text.size() && Text[i] == '\\' && Text[i + 1] ==
>>>> '\n') {
>>>> >        // FIXME: ++FormatTok->NewlinesBefore is missing...
>>>> >        WhitespaceLength += 2;
>>>> > -      FormatTok->TokenLength -= 2;
>>>> > +      FormatTok->ByteCount -= 2;
>>>> >        i += 2;
>>>> >      }
>>>> >
>>>> > @@ -1176,15 +1183,19 @@ private:
>>>> >
>>>> >      if (FormatTok->Tok.is(tok::greatergreater)) {
>>>> >        FormatTok->Tok.setKind(tok::greater);
>>>> > -      FormatTok->TokenLength = 1;
>>>> > +      FormatTok->ByteCount = 1;
>>>> >        GreaterStashed = true;
>>>> >      }
>>>> >
>>>> > +    unsigned EncodingExtraBytes =
>>>> > +        Text.size() - encoding::getCodePointCount(Text, Encoding);
>>>> > +    FormatTok->CodePointCount = FormatTok->ByteCount -
>>>> EncodingExtraBytes;
>>>> > +
>>>> >      FormatTok->WhitespaceRange = SourceRange(
>>>> >          WhitespaceStart,
>>>> WhitespaceStart.getLocWithOffset(WhitespaceLength));
>>>> >      FormatTok->TokenText = StringRef(
>>>> >
>>>>  SourceMgr.getCharacterData(FormatTok->getStartOfNonWhitespace()),
>>>> > -        FormatTok->TokenLength);
>>>> > +        FormatTok->ByteCount);
>>>> >      return FormatTok;
>>>> >    }
>>>> >
>>>> > @@ -1194,6 +1205,7 @@ private:
>>>> >    Lexer &Lex;
>>>> >    SourceManager &SourceMgr;
>>>> >    IdentifierTable IdentTable;
>>>> > +  encoding::Encoding Encoding;
>>>> >    llvm::SpecificBumpPtrAllocator<FormatToken> Allocator;
>>>> >    SmallVector<FormatToken *, 16> Tokens;
>>>> >
>>>> > @@ -1209,17 +1221,22 @@ public:
>>>> >    Formatter(const FormatStyle &Style, Lexer &Lex, SourceManager
>>>> &SourceMgr,
>>>> >              const std::vector<CharSourceRange> &Ranges)
>>>> >        : Style(Style), Lex(Lex), SourceMgr(SourceMgr),
>>>> > -        Whitespaces(SourceMgr, Style), Ranges(Ranges) {}
>>>> > +        Whitespaces(SourceMgr, Style), Ranges(Ranges),
>>>> > +        Encoding(encoding::detectEncoding(Lex.getBuffer())) {
>>>> > +    DEBUG(llvm::dbgs()
>>>> > +          << "File encoding: "
>>>> > +          << (Encoding == encoding::Encoding_UTF8 ? "UTF8" :
>>>> "unknown")
>>>> > +          << "\n");
>>>> > +  }
>>>> >
>>>> >    virtual ~Formatter() {}
>>>> >
>>>> >    tooling::Replacements format() {
>>>> > -    FormatTokenLexer Tokens(Lex, SourceMgr);
>>>> > +    FormatTokenLexer Tokens(Lex, SourceMgr, Encoding);
>>>> >
>>>> >      UnwrappedLineParser Parser(Style, Tokens.lex(), *this);
>>>> >      bool StructuralError = Parser.parse();
>>>> > -    TokenAnnotator Annotator(Style, SourceMgr, Lex,
>>>> > -                             Tokens.getIdentTable().get("in"));
>>>> > +    TokenAnnotator Annotator(Style,
>>>> Tokens.getIdentTable().get("in"));
>>>> >      for (unsigned i = 0, e = AnnotatedLines.size(); i != e; ++i) {
>>>> >        Annotator.annotate(AnnotatedLines[i]);
>>>> >      }
>>>> > @@ -1290,7 +1307,7 @@ public:
>>>> >                1;
>>>> >          }
>>>> >          UnwrappedLineFormatter Formatter(Style, SourceMgr, TheLine,
>>>> Indent,
>>>> > -                                         TheLine.First, Whitespaces);
>>>> > +                                         TheLine.First, Whitespaces,
>>>> Encoding);
>>>> >          Formatter.format(I + 1 != E ? &*(I + 1) : NULL);
>>>> >          IndentForLevel[TheLine.Level] = LevelIndent;
>>>> >          PreviousLineWasTouched = true;
>>>> > @@ -1556,7 +1573,7 @@ private:
>>>> >      CharSourceRange LineRange = CharSourceRange::getCharRange(
>>>> >          First->WhitespaceRange.getBegin().getLocWithOffset(
>>>> >              First->LastNewlineOffset),
>>>> > -        Last->Tok.getLocation().getLocWithOffset(Last->TokenLength -
>>>> 1));
>>>> > +        Last->Tok.getLocation().getLocWithOffset(Last->ByteCount -
>>>> 1));
>>>> >      return touchesRanges(LineRange);
>>>> >    }
>>>> >
>>>> > @@ -1616,6 +1633,8 @@ private:
>>>> >    WhitespaceManager Whitespaces;
>>>> >    std::vector<CharSourceRange> Ranges;
>>>> >    std::vector<AnnotatedLine> AnnotatedLines;
>>>> > +
>>>> > +  encoding::Encoding Encoding;
>>>> >  };
>>>> >
>>>> >  tooling::Replacements reformat(const FormatStyle &Style, Lexer &Lex,
>>>> >
>>>> > Modified: cfe/trunk/lib/Format/FormatToken.h
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/FormatToken.h?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/FormatToken.h (original)
>>>> > +++ cfe/trunk/lib/Format/FormatToken.h Wed Jun  5 09:09:10 2013
>>>> > @@ -61,11 +61,12 @@ enum TokenType {
>>>> >  struct FormatToken {
>>>> >    FormatToken()
>>>> >        : NewlinesBefore(0), HasUnescapedNewline(false),
>>>> LastNewlineOffset(0),
>>>> > -        TokenLength(0), IsFirst(false), MustBreakBefore(false),
>>>> > -        Type(TT_Unknown), SpacesRequiredBefore(0),
>>>> CanBreakBefore(false),
>>>> > -        ClosesTemplateDeclaration(false), ParameterCount(0),
>>>> TotalLength(0),
>>>> > -        UnbreakableTailLength(0), BindingStrength(0),
>>>> SplitPenalty(0),
>>>> > -        LongestObjCSelectorName(0), FakeRParens(0),
>>>> LastInChainOfCalls(false),
>>>> > +        ByteCount(0), CodePointCount(0), IsFirst(false),
>>>> > +        MustBreakBefore(false), Type(TT_Unknown),
>>>> SpacesRequiredBefore(0),
>>>> > +        CanBreakBefore(false), ClosesTemplateDeclaration(false),
>>>> > +        ParameterCount(0), TotalLength(0), UnbreakableTailLength(0),
>>>> > +        BindingStrength(0), SplitPenalty(0),
>>>> LongestObjCSelectorName(0),
>>>> > +        FakeRParens(0), LastInChainOfCalls(false),
>>>> >          PartOfMultiVariableDeclStmt(false), MatchingParen(NULL),
>>>> Previous(NULL),
>>>> >          Next(NULL) {}
>>>> >
>>>> > @@ -89,10 +90,14 @@ struct FormatToken {
>>>> >    /// whitespace (relative to \c WhiteSpaceStart). 0 if there is no
>>>> '\n'.
>>>> >    unsigned LastNewlineOffset;
>>>> >
>>>> > -  /// \brief The length of the non-whitespace parts of the token.
>>>> This is
>>>> > -  /// necessary because we need to handle escaped newlines that are
>>>> stored
>>>> > +  /// \brief The number of bytes of the non-whitespace parts of the
>>>> token. This
>>>> > +  /// is necessary because we need to handle escaped newlines that
>>>> are stored
>>>> >    /// with the token.
>>>> > -  unsigned TokenLength;
>>>> > +  unsigned ByteCount;
>>>> > +
>>>> > +  /// \brief The length of the non-whitespace parts of the token in
>>>> CodePoints.
>>>> > +  /// We need this to correctly measure number of columns a token
>>>> spans.
>>>> > +  unsigned CodePointCount;
>>>> >
>>>> >    /// \brief Indicates that this is the first token.
>>>> >    bool IsFirst;
>>>> >
>>>> > Modified: cfe/trunk/lib/Format/TokenAnnotator.cpp
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.cpp?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/TokenAnnotator.cpp (original)
>>>> > +++ cfe/trunk/lib/Format/TokenAnnotator.cpp Wed Jun  5 09:09:10 2013
>>>> > @@ -15,7 +15,6 @@
>>>> >
>>>> >  #include "TokenAnnotator.h"
>>>> >  #include "clang/Basic/SourceManager.h"
>>>> > -#include "clang/Lex/Lexer.h"
>>>> >  #include "llvm/Support/Debug.h"
>>>> >
>>>> >  namespace clang {
>>>> > @@ -28,10 +27,9 @@ namespace format {
>>>> >  /// into template parameter lists.
>>>> >  class AnnotatingParser {
>>>> >  public:
>>>> > -  AnnotatingParser(SourceManager &SourceMgr, Lexer &Lex,
>>>> AnnotatedLine &Line,
>>>> > -                   IdentifierInfo &Ident_in)
>>>> > -      : SourceMgr(SourceMgr), Lex(Lex), Line(Line),
>>>> CurrentToken(Line.First),
>>>> > -        KeywordVirtualFound(false), NameFound(false),
>>>> Ident_in(Ident_in) {
>>>> > +  AnnotatingParser(AnnotatedLine &Line, IdentifierInfo &Ident_in)
>>>> > +      : Line(Line), CurrentToken(Line.First),
>>>> KeywordVirtualFound(false),
>>>> > +        NameFound(false), Ident_in(Ident_in) {
>>>> >      Contexts.push_back(Context(tok::unknown, 1, /*IsExpression=*/
>>>> false));
>>>> >    }
>>>> >
>>>> > @@ -295,9 +293,11 @@ private:
>>>> >                   Line.First->Type == TT_ObjCMethodSpecifier) {
>>>> >          Tok->Type = TT_ObjCMethodExpr;
>>>> >          Tok->Previous->Type = TT_ObjCSelectorName;
>>>> > -        if (Tok->Previous->TokenLength >
>>>> > -            Contexts.back().LongestObjCSelectorName)
>>>> > -          Contexts.back().LongestObjCSelectorName =
>>>> Tok->Previous->TokenLength;
>>>> > +        if (Tok->Previous->CodePointCount >
>>>> > +            Contexts.back().LongestObjCSelectorName) {
>>>> > +          Contexts.back().LongestObjCSelectorName =
>>>> > +              Tok->Previous->CodePointCount;
>>>> > +        }
>>>> >          if (Contexts.back().FirstObjCSelectorName == NULL)
>>>> >            Contexts.back().FirstObjCSelectorName = Tok->Previous;
>>>> >        } else if (Contexts.back().ColonIsForRangeExpr) {
>>>> > @@ -602,9 +602,7 @@ private:
>>>> >        } else if (Current.isBinaryOperator()) {
>>>> >          Current.Type = TT_BinaryOperator;
>>>> >        } else if (Current.is(tok::comment)) {
>>>> > -        std::string Data(
>>>> > -            Lexer::getSpelling(Current.Tok, SourceMgr,
>>>> Lex.getLangOpts()));
>>>> > -        if (StringRef(Data).startswith("//"))
>>>> > +        if (Current.TokenText.startswith("//"))
>>>> >            Current.Type = TT_LineComment;
>>>> >          else
>>>> >            Current.Type = TT_BlockComment;
>>>> > @@ -748,23 +746,19 @@ private:
>>>> >      case tok::kw_wchar_t:
>>>> >      case tok::kw_bool:
>>>> >      case tok::kw___underlying_type:
>>>> > -      return true;
>>>> >      case tok::annot_typename:
>>>> >      case tok::kw_char16_t:
>>>> >      case tok::kw_char32_t:
>>>> >      case tok::kw_typeof:
>>>> >      case tok::kw_decltype:
>>>> > -      return Lex.getLangOpts().CPlusPlus;
>>>> > +      return true;
>>>> >      default:
>>>> > -      break;
>>>> > +      return false;
>>>> >      }
>>>> > -    return false;
>>>> >    }
>>>> >
>>>> >    SmallVector<Context, 8> Contexts;
>>>> >
>>>> > -  SourceManager &SourceMgr;
>>>> > -  Lexer &Lex;
>>>> >    AnnotatedLine &Line;
>>>> >    FormatToken *CurrentToken;
>>>> >    bool KeywordVirtualFound;
>>>> > @@ -866,7 +860,7 @@ private:
>>>> >  };
>>>> >
>>>> >  void TokenAnnotator::annotate(AnnotatedLine &Line) {
>>>> > -  AnnotatingParser Parser(SourceMgr, Lex, Line, Ident_in);
>>>> > +  AnnotatingParser Parser(Line, Ident_in);
>>>> >    Line.Type = Parser.parseLine();
>>>> >    if (Line.Type == LT_Invalid)
>>>> >      return;
>>>> > @@ -886,7 +880,7 @@ void TokenAnnotator::annotate(AnnotatedL
>>>> >  }
>>>> >
>>>> >  void TokenAnnotator::calculateFormattingInformation(AnnotatedLine
>>>> &Line) {
>>>> > -  Line.First->TotalLength = Line.First->TokenLength;
>>>> > +  Line.First->TotalLength = Line.First->CodePointCount;
>>>> >    if (!Line.First->Next)
>>>> >      return;
>>>> >    FormatToken *Current = Line.First->Next;
>>>> > @@ -920,7 +914,7 @@ void TokenAnnotator::calculateFormatting
>>>> >        Current->TotalLength = Current->Previous->TotalLength +
>>>> Style.ColumnLimit;
>>>> >      else
>>>> >        Current->TotalLength =
>>>> > -          Current->Previous->TotalLength + Current->TokenLength +
>>>> > +          Current->Previous->TotalLength + Current->CodePointCount +
>>>> >            Current->SpacesRequiredBefore;
>>>> >      // FIXME: Only calculate this if CanBreakBefore is true once
>>>> static
>>>> >      // initializers etc. are sorted out.
>>>> > @@ -947,7 +941,7 @@ void TokenAnnotator::calculateUnbreakabl
>>>> >        UnbreakableTailLength = 0;
>>>> >      } else {
>>>> >        UnbreakableTailLength +=
>>>> > -          Current->TokenLength + Current->SpacesRequiredBefore;
>>>> > +          Current->CodePointCount + Current->SpacesRequiredBefore;
>>>> >      }
>>>> >      Current = Current->Previous;
>>>> >    }
>>>> > @@ -1015,8 +1009,7 @@ unsigned TokenAnnotator::splitPenalty(co
>>>> >
>>>> >    if (Right.is(tok::lessless)) {
>>>> >      if (Left.is(tok::string_literal)) {
>>>> > -      StringRef Content =
>>>> > -          StringRef(Left.Tok.getLiteralData(), Left.TokenLength);
>>>> > +      StringRef Content = Left.TokenText;
>>>> >        Content = Content.drop_back(1).drop_front(1).trim();
>>>> >        if (Content.size() > 1 &&
>>>> >            (Content.back() == ':' || Content.back() == '='))
>>>> >
>>>> > Modified: cfe/trunk/lib/Format/TokenAnnotator.h
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.h?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/lib/Format/TokenAnnotator.h (original)
>>>> > +++ cfe/trunk/lib/Format/TokenAnnotator.h Wed Jun  5 09:09:10 2013
>>>> > @@ -21,7 +21,6 @@
>>>> >  #include <string>
>>>> >
>>>> >  namespace clang {
>>>> > -class Lexer;
>>>> >  class SourceManager;
>>>> >
>>>> >  namespace format {
>>>> > @@ -71,10 +70,8 @@ public:
>>>> >  /// \c UnwrappedLine.
>>>> >  class TokenAnnotator {
>>>> >  public:
>>>> > -  TokenAnnotator(const FormatStyle &Style, SourceManager &SourceMgr,
>>>> Lexer &Lex,
>>>> > -                 IdentifierInfo &Ident_in)
>>>> > -      : Style(Style), SourceMgr(SourceMgr), Lex(Lex),
>>>> Ident_in(Ident_in) {
>>>> > -  }
>>>> > +  TokenAnnotator(const FormatStyle &Style, IdentifierInfo &Ident_in)
>>>> > +      : Style(Style), Ident_in(Ident_in) {}
>>>> >
>>>> >    void annotate(AnnotatedLine &Line);
>>>> >    void calculateFormattingInformation(AnnotatedLine &Line);
>>>> > @@ -95,8 +92,6 @@ private:
>>>> >    void calculateUnbreakableTailLengths(AnnotatedLine &Line);
>>>> >
>>>> >    const FormatStyle &Style;
>>>> > -  SourceManager &SourceMgr;
>>>> > -  Lexer &Lex;
>>>> >
>>>> >    // Contextual keywords:
>>>> >    IdentifierInfo &Ident_in;
>>>> >
>>>> > Modified: cfe/trunk/unittests/Format/FormatTest.cpp
>>>> > URL:
>>>> http://llvm.org/viewvc/llvm-project/cfe/trunk/unittests/Format/FormatTest.cpp?rev=183312&r1=183311&r2=183312&view=diff
>>>> >
>>>> ==============================================================================
>>>> > --- cfe/trunk/unittests/Format/FormatTest.cpp (original)
>>>> > +++ cfe/trunk/unittests/Format/FormatTest.cpp Wed Jun  5 09:09:10 2013
>>>> > @@ -4873,5 +4873,80 @@ TEST_F(FormatTest, ConfigurationRoundTri
>>>> >    EXPECT_EQ(Style, ParsedStyle);
>>>> >  }
>>>> >
>>>> > +TEST_F(FormatTest, WorksFor8bitEncodings) {
>>>> > +  EXPECT_EQ("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 \"\n"
>>>> > +            "\"\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \"\n"
>>>> > +            "\"\xe7\xe8\xec\xed\xfe\xfe \"\n"
>>>> > +            "\"\xef\xee\xf0\xf3...\"",
>>>> > +            format("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 "
>>>> > +                   "\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe
>>>> \xe7\xe8\xec\xed\xfe\xfe "
>>>> > +                   "\xef\xee\xf0\xf3...\"",
>>>> > +                   getLLVMStyleWithColumns(12)));
>>>> > +}
>>>> > +
>>>> > +TEST_F(FormatTest, CountsUTF8CharactersProperly) {
>>>> > +  verifyFormat("\"ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹ Ð² Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ Ð·Ð¸Ð¼Ð½ÑŽÑŽ
>>>> Ð¿Ð¾Ñ€Ñƒ...\"",
>>>> > +               getLLVMStyleWithColumns(35));
>>>> > +  verifyFormat("\"ä¸€ äºŒ ä¸‰ å›› äº” å… ä¸ƒ å…« ä¹  å  \"",
>>>> > +               getLLVMStyleWithColumns(21));
>>>> > +  verifyFormat("// ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹ Ð² Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ Ð·Ð¸Ð¼Ð½ÑŽÑŽ
>>>> Ð¿Ð¾Ñ€Ñƒ...",
>>>> > +               getLLVMStyleWithColumns(36));
>>>> > +  verifyFormat("// ä¸€ äºŒ ä¸‰ å›› äº” å… ä¸ƒ å…« ä¹  å  ",
>>>> > +               getLLVMStyleWithColumns(22));
>>>> > +  verifyFormat("/* ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹ Ð² Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ Ð·Ð¸Ð¼Ð½ÑŽÑŽ
>>>> Ð¿Ð¾Ñ€Ñƒ... */",
>>>> > +               getLLVMStyleWithColumns(39));
>>>> > +  verifyFormat("/* ä¸€ äºŒ ä¸‰ å›› äº” å… ä¸ƒ å…« ä¹  å   */",
>>>> > +               getLLVMStyleWithColumns(25));
>>>> > +}
>>>> > +
>>>> > +TEST_F(FormatTest, SplitsUTF8Strings) {
>>>> > +  EXPECT_EQ(
>>>> > +      "\"ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹, Ð² \"\n"
>>>> > +      "\"Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ \"\n"
>>>> > +      "\"Ð·Ð¸Ð¼Ð½ÑŽÑŽ \"\n"
>>>> > +      "\"Ð¿Ð¾Ñ€Ñƒ,\"",
>>>> > +      format("\"ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹, Ð² Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ Ð·Ð¸Ð¼Ð½ÑŽÑŽ
>>>> Ð¿Ð¾Ñ€Ñƒ,\"",
>>>> > +             getLLVMStyleWithColumns(13)));
>>>> > +  EXPECT_EQ("\"ä¸€ äºŒ ä¸‰ å›› \"\n"
>>>> > +            "\"äº” å… ä¸ƒ å…« \"\n"
>>>> > +            "\"ä¹  å  \"",
>>>> > +            format("\"ä¸€ äºŒ ä¸‰ å›› äº” å… ä¸ƒ å…« ä¹  å  \"",
>>>> > +                   getLLVMStyleWithColumns(10)));
>>>> > +}
>>>> > +
>>>> > +TEST_F(FormatTest, SplitsUTF8LineComments) {
>>>> > +  EXPECT_EQ("// Ð¯ Ð¸Ð· Ð»ÐµÑ Ñƒ\n"
>>>> > +            "// Ð²Ñ‹ÑˆÐµÐ»; Ð±Ñ‹Ð»\n"
>>>> > +            "// Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹\n"
>>>> > +            "// Ð¼Ð¾Ñ€Ð¾Ð·.",
>>>> > +            format("// Ð¯ Ð¸Ð· Ð»ÐµÑ Ñƒ Ð²Ñ‹ÑˆÐµÐ»; Ð±Ñ‹Ð» Ñ
>>>> Ð¸Ð»ÑŒÐ½Ñ‹Ð¹ Ð¼Ð¾Ñ€Ð¾Ð·.",
>>>> > +                   getLLVMStyleWithColumns(13)));
>>>> > +  EXPECT_EQ("// ä¸€äºŒä¸‰\n"
>>>> > +            "// å››äº”å…ä¸ƒ\n"
>>>> > +            "// å…«\n"
>>>> > +            "// ä¹  å  ",
>>>> > +            format("// ä¸€äºŒä¸‰ å››äº”å…ä¸ƒ å…«  ä¹  å  ",
>>>> getLLVMStyleWithColumns(6)));
>>>> > +}
>>>> > +
>>>> > +TEST_F(FormatTest, SplitsUTF8BlockComments) {
>>>> > +  EXPECT_EQ("/* Ð“Ð»Ñ Ð¶Ñƒ,\n"
>>>> > +            " * Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ \n"
>>>> > +            " * Ð¼ÐµÐ´Ð»ÐµÐ½Ð½Ð¾ Ð²\n"
>>>> > +            " * Ð³Ð¾Ñ€Ñƒ\n"
>>>> > +            " * Ð›Ð¾ÑˆÐ°Ð´ÐºÐ°,\n"
>>>> > +            " * Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ \n"
>>>> > +            " * Ñ…Ð²Ð¾Ñ€Ð¾Ñ Ñ‚Ñƒ\n"
>>>> > +            " * Ð²Ð¾Ð·. */",
>>>> > +            format("/* Ð“Ð»Ñ Ð¶Ñƒ, Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ
>>>>  Ð¼ÐµÐ´Ð»ÐµÐ½Ð½Ð¾ Ð² Ð³Ð¾Ñ€Ñƒ\n"
>>>> > +                   " * Ð›Ð¾ÑˆÐ°Ð´ÐºÐ°, Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ  Ñ…Ð²Ð¾Ñ€Ð¾Ñ
>>>> Ñ‚Ñƒ Ð²Ð¾Ð·. */",
>>>> > +                   getLLVMStyleWithColumns(13)));
>>>> > +  EXPECT_EQ("/* ä¸€äºŒä¸‰\n"
>>>> > +            " * å››äº”å…ä¸ƒ\n"
>>>> > +            " * å…«\n"
>>>> > +            " * ä¹  å  \n"
>>>> > +            " */",
>>>> > +            format("/* ä¸€äºŒä¸‰ å››äº”å…ä¸ƒ å…«  ä¹  å   */",
>>>> getLLVMStyleWithColumns(6)));
>>>> > +}
>>>> > +
>>>> >  } // end namespace tooling
>>>> >  } // end namespace clang
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > cfe-commits mailing list
>>>> > cfe-commits at cs.uiuc.edu
>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>>>
>>>
>>>
>>> _______________________________________________
>>> cfe-commits mailing list
>>> cfe-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
>>>
>>>
>>
>

-- 
Alexander Kornienko | Software Engineer | alexfh at google.com | +49 151 221
77 957
Google Germany GmbH | Dienerstr. 12 | 80331 München
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130607/48fcc115/attachment.html>