r183312 - UTF-8 support for clang-format.

Jordan Rose jordan_rose at apple.com
Fri Jun 7 09:39:50 PDT 2013


What's wrong with using escapes? (Other than readability....or is that it?)

On Jun 7, 2013, at 8:46 , Alexander Kornienko <alexfh at google.com> wrote:

> A bit more information: the issue that tracks support for UTF-8 BOM is http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415. The patch seems to be present in GCC from version 4.4.0 on. LLVM claims to support GCC from 3.4 (with numerous exceptions): http://llvm.org/docs/GettingStarted.html#software
> 
> So I'd better not introduce a change that would even more disappoint poor guys who still have to deal with older GCC versions.
> 
> I'll leave it #ifdefed out until we move it to a file-based test.
> 
> 
> 
> On Fri, Jun 7, 2013 at 3:52 PM, Alexander Kornienko <alexfh at google.com> wrote:
> On Fri, Jun 7, 2013 at 6:46 AM, Nico Weber <thakis at chromium.org> wrote:
> On Thu, Jun 6, 2013 at 4:49 PM, Alexander Kornienko <alexfh at google.com> wrote:
> On Thu, Jun 6, 2013 at 1:11 AM, NAKAMURA Takumi <geek4civic at gmail.com> wrote:
> I wonder the source file could contain utf8 characters.
> 
> It's implementation-defined behavior. Apparently GCC and Clang handle this correctly.
>  
> In fact, MS cl.exe misdetects charsets against rather system
> codepage(932) than current codepage (65001), without BOM.
>  
> Seems like adding UTF-8 BOM is the only way to force MSVC treat a source file as UTF-8. But this is not supported by GCC and Clang, AFAIK.
> 
> clang's Lexer::InitLexer() skips BOMs.
> 
> Sounds interesting. And here they say that GCC also supports this. I've checked with Clang trunk and GCC 4.6.3, and it works. Then are there any reasons not to just add UTF-8 BOM?
>  
>  
> 
> Could you get rid of raw utf8 characters and encode them in literals?
> FYI, I can see Cyrillic and CJK :)
> 
> There's a plan to make some of our tests file-based instead of unit tests. I think, utf-8 tests are the first candidate for this. As UTF-8 support is not the most important thing for Windows builds of clang-format, I'd leave the new tests just #ifdefed out for now. BTW, thanks for doing this.
>  
> 
> ...Takumi
> 
> 2013/6/5 Alexander Kornienko <alexfh at google.com>:
> > Author: alexfh
> > Date: Wed Jun  5 09:09:10 2013
> > New Revision: 183312
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=183312&view=rev
> > Log:
> > UTF-8 support for clang-format.
> >
> > Summary:
> > Detect if the file is valid UTF-8, and if this is the case, count code
> > points instead of just using number of bytes in all (hopefully) places, where
> > number of columns is needed. In particular, use the new
> > FormatToken.CodePointCount instead of TokenLength where appropriate.
> > Changed BreakableToken implementations to respect utf-8 character boundaries
> > when in utf-8 mode.
> >
> > Reviewers: klimek, djasper
> >
> > Reviewed By: djasper
> >
> > CC: cfe-commits, rsmith, gribozavr
> >
> > Differential Revision: http://llvm-reviews.chandlerc.com/D918
> >
> > Added:
> >     cfe/trunk/lib/Format/Encoding.h
> > Modified:
> >     cfe/trunk/lib/Format/BreakableToken.cpp
> >     cfe/trunk/lib/Format/BreakableToken.h
> >     cfe/trunk/lib/Format/Format.cpp
> >     cfe/trunk/lib/Format/FormatToken.h
> >     cfe/trunk/lib/Format/TokenAnnotator.cpp
> >     cfe/trunk/lib/Format/TokenAnnotator.h
> >     cfe/trunk/unittests/Format/FormatTest.cpp
> >
> > Modified: cfe/trunk/lib/Format/BreakableToken.cpp
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.cpp?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/lib/Format/BreakableToken.cpp (original)
> > +++ cfe/trunk/lib/Format/BreakableToken.cpp Wed Jun  5 09:09:10 2013
> > @@ -25,66 +25,22 @@ namespace clang {
> >  namespace format {
> >  namespace {
> >
> > -// FIXME: Move helper string functions to where it makes sense.
> > -
> > -unsigned getOctalLength(StringRef Text) {
> > -  unsigned I = 1;
> > -  while (I < Text.size() && I < 4 && (Text[I] >= '0' && Text[I] <= '7')) {
> > -    ++I;
> > -  }
> > -  return I;
> > -}
> > -
> > -unsigned getHexLength(StringRef Text) {
> > -  unsigned I = 2; // Point after '\x'.
> > -  while (I < Text.size() && ((Text[I] >= '0' && Text[I] <= '9') ||
> > -                             (Text[I] >= 'a' && Text[I] <= 'f') ||
> > -                             (Text[I] >= 'A' && Text[I] <= 'F'))) {
> > -    ++I;
> > -  }
> > -  return I;
> > -}
> > -
> > -unsigned getEscapeSequenceLength(StringRef Text) {
> > -  assert(Text[0] == '\\');
> > -  if (Text.size() < 2)
> > -    return 1;
> > -
> > -  switch (Text[1]) {
> > -  case 'u':
> > -    return 6;
> > -  case 'U':
> > -    return 10;
> > -  case 'x':
> > -    return getHexLength(Text);
> > -  default:
> > -    if (Text[1] >= '0' && Text[1] <= '7')
> > -      return getOctalLength(Text);
> > -    return 2;
> > -  }
> > -}
> > -
> > -StringRef::size_type getStartOfCharacter(StringRef Text,
> > -                                         StringRef::size_type Offset) {
> > -  StringRef::size_type NextEscape = Text.find('\\');
> > -  while (NextEscape != StringRef::npos && NextEscape < Offset) {
> > -    StringRef::size_type SequenceLength =
> > -        getEscapeSequenceLength(Text.substr(NextEscape));
> > -    if (Offset < NextEscape + SequenceLength)
> > -      return NextEscape;
> > -    NextEscape = Text.find('\\', NextEscape + SequenceLength);
> > -  }
> > -  return Offset;
> > -}
> > -
> >  BreakableToken::Split getCommentSplit(StringRef Text,
> >                                        unsigned ContentStartColumn,
> > -                                      unsigned ColumnLimit) {
> > +                                      unsigned ColumnLimit,
> > +                                      encoding::Encoding Encoding) {
> >    if (ColumnLimit <= ContentStartColumn + 1)
> >      return BreakableToken::Split(StringRef::npos, 0);
> >
> >    unsigned MaxSplit = ColumnLimit - ContentStartColumn + 1;
> > -  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);
> > +  unsigned MaxSplitBytes = 0;
> > +
> > +  for (unsigned NumChars = 0;
> > +       NumChars < MaxSplit && MaxSplitBytes < Text.size(); ++NumChars)
> > +    MaxSplitBytes +=
> > +        encoding::getCodePointNumBytes(Text[MaxSplitBytes], Encoding);
> > +
> > +  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplitBytes);
> >    if (SpaceOffset == StringRef::npos ||
> >        // Don't break at leading whitespace.
> >        Text.find_last_not_of(' ', SpaceOffset) == StringRef::npos) {
> > @@ -95,7 +51,7 @@ BreakableToken::Split getCommentSplit(St
> >        // If the comment is only whitespace, we cannot split.
> >        return BreakableToken::Split(StringRef::npos, 0);
> >      SpaceOffset =
> > -        Text.find(' ', std::max<unsigned>(MaxSplit, FirstNonWhitespace));
> > +        Text.find(' ', std::max<unsigned>(MaxSplitBytes, FirstNonWhitespace));
> >    }
> >    if (SpaceOffset != StringRef::npos && SpaceOffset != 0) {
> >      StringRef BeforeCut = Text.substr(0, SpaceOffset).rtrim();
> > @@ -108,25 +64,48 @@ BreakableToken::Split getCommentSplit(St
> >
> >  BreakableToken::Split getStringSplit(StringRef Text,
> >                                       unsigned ContentStartColumn,
> > -                                     unsigned ColumnLimit) {
> > -
> > -  if (ColumnLimit <= ContentStartColumn)
> > -    return BreakableToken::Split(StringRef::npos, 0);
> > -  unsigned MaxSplit = ColumnLimit - ContentStartColumn;
> > +                                     unsigned ColumnLimit,
> > +                                     encoding::Encoding Encoding) {
> >    // FIXME: Reduce unit test case.
> >    if (Text.empty())
> >      return BreakableToken::Split(StringRef::npos, 0);
> > -  MaxSplit = std::min<unsigned>(MaxSplit, Text.size() - 1);
> > -  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);
> > -  if (SpaceOffset != StringRef::npos && SpaceOffset != 0)
> > +  if (ColumnLimit <= ContentStartColumn)
> > +    return BreakableToken::Split(StringRef::npos, 0);
> > +  unsigned MaxSplit =
> > +      std::min<unsigned>(ColumnLimit - ContentStartColumn,
> > +                         encoding::getCodePointCount(Text, Encoding) - 1);
> > +  StringRef::size_type SpaceOffset = 0;
> > +  StringRef::size_type SlashOffset = 0;
> > +  StringRef::size_type SplitPoint = 0;
> > +  for (unsigned Chars = 0;;) {
> > +    unsigned Advance;
> > +    if (Text[0] == '\\') {
> > +      Advance = encoding::getEscapeSequenceLength(Text);
> > +      Chars += Advance;
> > +    } else {
> > +      Advance = encoding::getCodePointNumBytes(Text[0], Encoding);
> > +      Chars += 1;
> > +    }
> > +
> > +    if (Chars > MaxSplit)
> > +      break;
> > +
> > +    if (Text[0] == ' ')
> > +      SpaceOffset = SplitPoint;
> > +    if (Text[0] == '/')
> > +      SlashOffset = SplitPoint;
> > +
> > +    SplitPoint += Advance;
> > +    Text = Text.substr(Advance);
> > +  }
> > +
> > +  if (SpaceOffset != 0)
> >      return BreakableToken::Split(SpaceOffset + 1, 0);
> > -  StringRef::size_type SlashOffset = Text.rfind('/', MaxSplit);
> > -  if (SlashOffset != StringRef::npos && SlashOffset != 0)
> > +  if (SlashOffset != 0)
> >      return BreakableToken::Split(SlashOffset + 1, 0);
> > -  StringRef::size_type SplitPoint = getStartOfCharacter(Text, MaxSplit);
> > -  if (SplitPoint == StringRef::npos || SplitPoint == 0)
> > -    return BreakableToken::Split(StringRef::npos, 0);
> > -  return BreakableToken::Split(SplitPoint, 0);
> > +  if (SplitPoint != 0)
> > +    return BreakableToken::Split(SplitPoint, 0);
> > +  return BreakableToken::Split(StringRef::npos, 0);
> >  }
> >
> >  } // namespace
> > @@ -136,8 +115,8 @@ unsigned BreakableSingleLineToken::getLi
> >  unsigned
> >  BreakableSingleLineToken::getLineLengthAfterSplit(unsigned LineIndex,
> >                                                    unsigned TailOffset) const {
> > -  return StartColumn + Prefix.size() + Postfix.size() + Line.size() -
> > -         TailOffset;
> > +  return StartColumn + Prefix.size() + Postfix.size() +
> > +         encoding::getCodePointCount(Line.substr(TailOffset), Encoding);
> >  }
> >
> >  void BreakableSingleLineToken::insertBreak(unsigned LineIndex,
> > @@ -152,8 +131,9 @@ void BreakableSingleLineToken::insertBre
> >  BreakableSingleLineToken::BreakableSingleLineToken(const FormatToken &Tok,
> >                                                     unsigned StartColumn,
> >                                                     StringRef Prefix,
> > -                                                   StringRef Postfix)
> > -    : BreakableToken(Tok), StartColumn(StartColumn), Prefix(Prefix),
> > +                                                   StringRef Postfix,
> > +                                                   encoding::Encoding Encoding)
> > +    : BreakableToken(Tok, Encoding), StartColumn(StartColumn), Prefix(Prefix),
> >        Postfix(Postfix) {
> >    assert(Tok.TokenText.startswith(Prefix) && Tok.TokenText.endswith(Postfix));
> >    Line = Tok.TokenText.substr(
> > @@ -161,13 +141,15 @@ BreakableSingleLineToken::BreakableSingl
> >  }
> >
> >  BreakableStringLiteral::BreakableStringLiteral(const FormatToken &Tok,
> > -                                               unsigned StartColumn)
> > -    : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"") {}
> > +                                               unsigned StartColumn,
> > +                                               encoding::Encoding Encoding)
> > +    : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"", Encoding) {}
> >
> >  BreakableToken::Split
> >  BreakableStringLiteral::getSplit(unsigned LineIndex, unsigned TailOffset,
> >                                   unsigned ColumnLimit) const {
> > -  return getStringSplit(Line.substr(TailOffset), StartColumn + 2, ColumnLimit);
> > +  return getStringSplit(Line.substr(TailOffset), StartColumn + 2, ColumnLimit,
> > +                        Encoding);
> >  }
> >
> >  static StringRef getLineCommentPrefix(StringRef Comment) {
> > @@ -179,23 +161,23 @@ static StringRef getLineCommentPrefix(St
> >  }
> >
> >  BreakableLineComment::BreakableLineComment(const FormatToken &Token,
> > -                                           unsigned StartColumn)
> > +                                           unsigned StartColumn,
> > +                                           encoding::Encoding Encoding)
> >      : BreakableSingleLineToken(Token, StartColumn,
> > -                               getLineCommentPrefix(Token.TokenText), "") {}
> > +                               getLineCommentPrefix(Token.TokenText), "",
> > +                               Encoding) {}
> >
> >  BreakableToken::Split
> >  BreakableLineComment::getSplit(unsigned LineIndex, unsigned TailOffset,
> >                                 unsigned ColumnLimit) const {
> >    return getCommentSplit(Line.substr(TailOffset), StartColumn + Prefix.size(),
> > -                         ColumnLimit);
> > +                         ColumnLimit, Encoding);
> >  }
> >
> > -BreakableBlockComment::BreakableBlockComment(const FormatStyle &Style,
> > -                                             const FormatToken &Token,
> > -                                             unsigned StartColumn,
> > -                                             unsigned OriginalStartColumn,
> > -                                             bool FirstInLine)
> > -    : BreakableToken(Token) {
> > +BreakableBlockComment::BreakableBlockComment(
> > +    const FormatStyle &Style, const FormatToken &Token, unsigned StartColumn,
> > +    unsigned OriginalStartColumn, bool FirstInLine, encoding::Encoding Encoding)
> > +    : BreakableToken(Token, Encoding) {
> >    StringRef TokenText(Token.TokenText);
> >    assert(TokenText.startswith("/*") && TokenText.endswith("*/"));
> >    TokenText.substr(2, TokenText.size() - 4).split(Lines, "\n");
> > @@ -290,7 +272,8 @@ unsigned
> >  BreakableBlockComment::getLineLengthAfterSplit(unsigned LineIndex,
> >                                                 unsigned TailOffset) const {
> >    return getContentStartColumn(LineIndex, TailOffset) +
> > -         (Lines[LineIndex].size() - TailOffset) +
> > +         encoding::getCodePointCount(Lines[LineIndex].substr(TailOffset),
> > +                                     Encoding) +
> >           // The last line gets a "*/" postfix.
> >           (LineIndex + 1 == Lines.size() ? 2 : 0);
> >  }
> > @@ -300,7 +283,7 @@ BreakableBlockComment::getSplit(unsigned
> >                                  unsigned ColumnLimit) const {
> >    return getCommentSplit(Lines[LineIndex].substr(TailOffset),
> >                           getContentStartColumn(LineIndex, TailOffset),
> > -                         ColumnLimit);
> > +                         ColumnLimit, Encoding);
> >  }
> >
> >  void BreakableBlockComment::insertBreak(unsigned LineIndex, unsigned TailOffset,
> >
> > Modified: cfe/trunk/lib/Format/BreakableToken.h
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.h?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/lib/Format/BreakableToken.h (original)
> > +++ cfe/trunk/lib/Format/BreakableToken.h Wed Jun  5 09:09:10 2013
> > @@ -17,6 +17,7 @@
> >  #ifndef LLVM_CLANG_FORMAT_BREAKABLETOKEN_H
> >  #define LLVM_CLANG_FORMAT_BREAKABLETOKEN_H
> >
> > +#include "Encoding.h"
> >  #include "TokenAnnotator.h"
> >  #include "WhitespaceManager.h"
> >  #include <utility>
> > @@ -65,9 +66,11 @@ public:
> >                                         WhitespaceManager &Whitespaces) {}
> >
> >  protected:
> > -  BreakableToken(const FormatToken &Tok) : Tok(Tok) {}
> > +  BreakableToken(const FormatToken &Tok, encoding::Encoding Encoding)
> > +      : Tok(Tok), Encoding(Encoding) {}
> >
> >    const FormatToken &Tok;
> > +  encoding::Encoding Encoding;
> >  };
> >
> >  /// \brief Base class for single line tokens that can be broken.
> > @@ -83,7 +86,8 @@ public:
> >
> >  protected:
> >    BreakableSingleLineToken(const FormatToken &Tok, unsigned StartColumn,
> > -                           StringRef Prefix, StringRef Postfix);
> > +                           StringRef Prefix, StringRef Postfix,
> > +                           encoding::Encoding Encoding);
> >
> >    // The column in which the token starts.
> >    unsigned StartColumn;
> > @@ -101,7 +105,8 @@ public:
> >    ///
> >    /// \p StartColumn specifies the column in which the token will start
> >    /// after formatting.
> > -  BreakableStringLiteral(const FormatToken &Tok, unsigned StartColumn);
> > +  BreakableStringLiteral(const FormatToken &Tok, unsigned StartColumn,
> > +                         encoding::Encoding Encoding);
> >
> >    virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,
> >                           unsigned ColumnLimit) const;
> > @@ -113,7 +118,8 @@ public:
> >    ///
> >    /// \p StartColumn specifies the column in which the comment will start
> >    /// after formatting.
> > -  BreakableLineComment(const FormatToken &Token, unsigned StartColumn);
> > +  BreakableLineComment(const FormatToken &Token, unsigned StartColumn,
> > +                       encoding::Encoding Encoding);
> >
> >    virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,
> >                           unsigned ColumnLimit) const;
> > @@ -129,7 +135,7 @@ public:
> >    /// If the comment starts a line after formatting, set \p FirstInLine to true.
> >    BreakableBlockComment(const FormatStyle &Style, const FormatToken &Token,
> >                          unsigned StartColumn, unsigned OriginaStartColumn,
> > -                        bool FirstInLine);
> > +                        bool FirstInLine, encoding::Encoding Encoding);
> >
> >    virtual unsigned getLineCount() const;
> >    virtual unsigned getLineLengthAfterSplit(unsigned LineIndex,
> >
> > Added: cfe/trunk/lib/Format/Encoding.h
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Encoding.h?rev=183312&view=auto
> > ==============================================================================
> > --- cfe/trunk/lib/Format/Encoding.h (added)
> > +++ cfe/trunk/lib/Format/Encoding.h Wed Jun  5 09:09:10 2013
> > @@ -0,0 +1,114 @@
> > +//===--- Encoding.h - Format C++ code -------------------------------------===//
> > +//
> > +//                     The LLVM Compiler Infrastructure
> > +//
> > +// This file is distributed under the University of Illinois Open Source
> > +// License. See LICENSE.TXT for details.
> > +//
> > +//===----------------------------------------------------------------------===//
> > +///
> > +/// \file
> > +/// \brief Contains functions for text encoding manipulation. Supports UTF-8,
> > +/// 8-bit encodings and escape sequences in C++ string literals.
> > +///
> > +//===----------------------------------------------------------------------===//
> > +
> > +#ifndef LLVM_CLANG_FORMAT_ENCODING_H
> > +#define LLVM_CLANG_FORMAT_ENCODING_H
> > +
> > +#include "clang/Basic/LLVM.h"
> > +#include "llvm/Support/ConvertUTF.h"
> > +
> > +namespace clang {
> > +namespace format {
> > +namespace encoding {
> > +
> > +enum Encoding {
> > +  Encoding_UTF8,
> > +  Encoding_Unknown // We treat all other encodings as 8-bit encodings.
> > +};
> > +
> > +/// \brief Detects encoding of the Text. If the Text can be decoded using UTF-8,
> > +/// it is considered UTF8, otherwise we treat it as some 8-bit encoding.
> > +inline Encoding detectEncoding(StringRef Text) {
> > +  const UTF8 *Ptr = reinterpret_cast<const UTF8 *>(Text.begin());
> > +  const UTF8 *BufEnd = reinterpret_cast<const UTF8 *>(Text.end());
> > +  if (::isLegalUTF8String(&Ptr, BufEnd))
> > +    return Encoding_UTF8;
> > +  return Encoding_Unknown;
> > +}
> > +
> > +inline unsigned getCodePointCountUTF8(StringRef Text) {
> > +  unsigned CodePoints = 0;
> > +  for (size_t i = 0, e = Text.size(); i < e; i += getNumBytesForUTF8(Text[i])) {
> > +    ++CodePoints;
> > +  }
> > +  return CodePoints;
> > +}
> > +
> > +/// \brief Gets the number of code points in the Text using the specified
> > +/// Encoding.
> > +inline unsigned getCodePointCount(StringRef Text, Encoding Encoding) {
> > +  switch (Encoding) {
> > +    case Encoding_UTF8:
> > +      return getCodePointCountUTF8(Text);
> > +    default:
> > +      return Text.size();
> > +  }
> > +}
> > +
> > +/// \brief Gets the number of bytes in a sequence representing a single
> > +/// codepoint and starting with FirstChar in the specified Encoding.
> > +inline unsigned getCodePointNumBytes(char FirstChar, Encoding Encoding) {
> > +  switch (Encoding) {
> > +    case Encoding_UTF8:
> > +      return getNumBytesForUTF8(FirstChar);
> > +    default:
> > +      return 1;
> > +  }
> > +}
> > +
> > +inline bool isOctDigit(char c) {
> > +  return '0' <= c && c <= '7';
> > +}
> > +
> > +inline bool isHexDigit(char c) {
> > +  return ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||
> > +         ('A' <= c && c <= 'F');
> > +}
> > +
> > +/// \brief Gets the length of an escape sequence inside a C++ string literal.
> > +/// Text should span from the beginning of the escape sequence (starting with a
> > +/// backslash) to the end of the string literal.
> > +inline unsigned getEscapeSequenceLength(StringRef Text) {
> > +  assert(Text[0] == '\\');
> > +  if (Text.size() < 2)
> > +    return 1;
> > +
> > +  switch (Text[1]) {
> > +  case 'u':
> > +    return 6;
> > +  case 'U':
> > +    return 10;
> > +  case 'x': {
> > +    unsigned I = 2; // Point after '\x'.
> > +    while (I < Text.size() && isHexDigit(Text[I]))
> > +      ++I;
> > +    return I;
> > +  }
> > +  default:
> > +    if (isOctDigit(Text[1])) {
> > +      unsigned I = 1;
> > +      while (I < Text.size() && I < 4 && isOctDigit(Text[I]))
> > +        ++I;
> > +      return I;
> > +    }
> > +    return 2;
> > +  }
> > +}
> > +
> > +} // namespace encoding
> > +} // namespace format
> > +} // namespace clang
> > +
> > +#endif // LLVM_CLANG_FORMAT_ENCODING_H
> >
> > Modified: cfe/trunk/lib/Format/Format.cpp
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Format.cpp?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/lib/Format/Format.cpp (original)
> > +++ cfe/trunk/lib/Format/Format.cpp Wed Jun  5 09:09:10 2013
> > @@ -243,10 +243,11 @@ public:
> >    UnwrappedLineFormatter(const FormatStyle &Style, SourceManager &SourceMgr,
> >                           const AnnotatedLine &Line, unsigned FirstIndent,
> >                           const FormatToken *RootToken,
> > -                         WhitespaceManager &Whitespaces)
> > +                         WhitespaceManager &Whitespaces,
> > +                         encoding::Encoding Encoding)
> >        : Style(Style), SourceMgr(SourceMgr), Line(Line),
> >          FirstIndent(FirstIndent), RootToken(RootToken),
> > -        Whitespaces(Whitespaces), Count(0) {}
> > +        Whitespaces(Whitespaces), Count(0), Encoding(Encoding) {}
> >
> >    /// \brief Formats an \c UnwrappedLine.
> >    void format(const AnnotatedLine *NextLine) {
> > @@ -484,7 +485,7 @@ private:
> >                                   State.NextToken->WhitespaceRange.getEnd()) -
> >                               SourceMgr.getSpellingColumnNumber(
> >                                   State.NextToken->WhitespaceRange.getBegin());
> > -      State.Column += WhitespaceLength + State.NextToken->TokenLength;
> > +      State.Column += WhitespaceLength + State.NextToken->CodePointCount;
> >        State.NextToken = State.NextToken->Next;
> >        return 0;
> >      }
> > @@ -520,11 +521,11 @@ private:
> >                    Line.StartsDefinition)) {
> >          State.Column = State.Stack.back().Indent;
> >        } else if (Current.Type == TT_ObjCSelectorName) {
> > -        if (State.Stack.back().ColonPos > Current.TokenLength) {
> > -          State.Column = State.Stack.back().ColonPos - Current.TokenLength;
> > +        if (State.Stack.back().ColonPos > Current.CodePointCount) {
> > +          State.Column = State.Stack.back().ColonPos - Current.CodePointCount;
> >          } else {
> >            State.Column = State.Stack.back().Indent;
> > -          State.Stack.back().ColonPos = State.Column + Current.TokenLength;
> > +          State.Stack.back().ColonPos = State.Column + Current.CodePointCount;
> >          }
> >        } else if (Current.Type == TT_StartOfName ||
> >                   Previous.isOneOf(tok::coloncolon, tok::equal) ||
> > @@ -560,7 +561,7 @@ private:
> >        State.Stack.back().LastSpace = State.Column;
> >        if (Current.isOneOf(tok::arrow, tok::period) &&
> >            Current.Type != TT_DesignatedInitializerPeriod)
> > -        State.Stack.back().LastSpace += Current.TokenLength;
> > +        State.Stack.back().LastSpace += Current.CodePointCount;
> >        State.StartOfLineLevel = State.ParenLevel;
> >        State.LowestCallLevel = State.ParenLevel;
> >
> > @@ -595,8 +596,8 @@ private:
> >          State.Stack.back().VariablePos = State.Column;
> >          // Move over * and & if they are bound to the variable name.
> >          const FormatToken *Tok = &Previous;
> > -        while (Tok && State.Stack.back().VariablePos >= Tok->TokenLength) {
> > -          State.Stack.back().VariablePos -= Tok->TokenLength;
> > +        while (Tok && State.Stack.back().VariablePos >= Tok->CodePointCount) {
> > +          State.Stack.back().VariablePos -= Tok->CodePointCount;
> >            if (Tok->SpacesRequiredBefore != 0)
> >              break;
> >            Tok = Tok->Previous;
> > @@ -614,12 +615,12 @@ private:
> >        if (Current.Type == TT_ObjCSelectorName &&
> >            State.Stack.back().ColonPos == 0) {
> >          if (State.Stack.back().Indent + Current.LongestObjCSelectorName >
> > -            State.Column + Spaces + Current.TokenLength)
> > +            State.Column + Spaces + Current.CodePointCount)
> >            State.Stack.back().ColonPos =
> >                State.Stack.back().Indent + Current.LongestObjCSelectorName;
> >          else
> >            State.Stack.back().ColonPos =
> > -              State.Column + Spaces + Current.TokenLength;
> > +              State.Column + Spaces + Current.CodePointCount;
> >        }
> >
> >        if (Previous.opensScope() && Previous.Type != TT_ObjCMethodExpr &&
> > @@ -671,7 +672,8 @@ private:
> >        State.LowestCallLevel = std::min(State.LowestCallLevel, State.ParenLevel);
> >        if (Line.Type == LT_BuilderTypeCall && State.ParenLevel == 0)
> >          State.Stack.back().StartOfFunctionCall =
> > -            Current.LastInChainOfCalls ? 0 : State.Column + Current.TokenLength;
> > +            Current.LastInChainOfCalls ? 0
> > +                                       : State.Column + Current.CodePointCount;
> >      }
> >      if (Current.Type == TT_CtorInitializerColon) {
> >        // Indent 2 from the column, so:
> > @@ -779,7 +781,7 @@ private:
> >        State.StartOfStringLiteral = 0;
> >      }
> >
> > -    State.Column += Current.TokenLength;
> > +    State.Column += Current.CodePointCount;
> >
> >      State.NextToken = State.NextToken->Next;
> >
> > @@ -798,7 +800,7 @@ private:
> >                                  bool DryRun) {
> >      unsigned UnbreakableTailLength = Current.UnbreakableTailLength;
> >      llvm::OwningPtr<BreakableToken> Token;
> > -    unsigned StartColumn = State.Column - Current.TokenLength;
> > +    unsigned StartColumn = State.Column - Current.CodePointCount;
> >      unsigned OriginalStartColumn =
> >          SourceMgr.getSpellingColumnNumber(Current.getStartOfNonWhitespace()) -
> >          1;
> > @@ -811,15 +813,16 @@ private:
> >        if (!LiteralData || *LiteralData != '"')
> >          return 0;
> >
> > -      Token.reset(new BreakableStringLiteral(Current, StartColumn));
> > +      Token.reset(new BreakableStringLiteral(Current, StartColumn, Encoding));
> >      } else if (Current.Type == TT_BlockComment) {
> >        BreakableBlockComment *BBC = new BreakableBlockComment(
> > -          Style, Current, StartColumn, OriginalStartColumn, !Current.Previous);
> > +          Style, Current, StartColumn, OriginalStartColumn, !Current.Previous,
> > +          Encoding);
> >        Token.reset(BBC);
> >      } else if (Current.Type == TT_LineComment &&
> >                 (Current.Previous == NULL ||
> >                  Current.Previous->Type != TT_ImplicitStringLiteral)) {
> > -      Token.reset(new BreakableLineComment(Current, StartColumn));
> > +      Token.reset(new BreakableLineComment(Current, StartColumn, Encoding));
> >      } else {
> >        return 0;
> >      }
> > @@ -837,27 +840,27 @@ private:
> >                                         Whitespaces);
> >        }
> >        unsigned TailOffset = 0;
> > -      unsigned RemainingTokenLength =
> > +      unsigned RemainingTokenColumns =
> >            Token->getLineLengthAfterSplit(LineIndex, TailOffset);
> > -      while (RemainingTokenLength > RemainingSpace) {
> > +      while (RemainingTokenColumns > RemainingSpace) {
> >          BreakableToken::Split Split =
> >              Token->getSplit(LineIndex, TailOffset, getColumnLimit());
> >          if (Split.first == StringRef::npos)
> >            break;
> >          assert(Split.first != 0);
> > -        unsigned NewRemainingTokenLength = Token->getLineLengthAfterSplit(
> > +        unsigned NewRemainingTokenColumns = Token->getLineLengthAfterSplit(
> >              LineIndex, TailOffset + Split.first + Split.second);
> > -        assert(NewRemainingTokenLength < RemainingTokenLength);
> > +        assert(NewRemainingTokenColumns < RemainingTokenColumns);
> >          if (!DryRun) {
> >            Token->insertBreak(LineIndex, TailOffset, Split, Line.InPPDirective,
> >                               Whitespaces);
> >          }
> >          TailOffset += Split.first + Split.second;
> > -        RemainingTokenLength = NewRemainingTokenLength;
> > +        RemainingTokenColumns = NewRemainingTokenColumns;
> >          Penalty += Style.PenaltyExcessCharacter;
> >          BreakInserted = true;
> >        }
> > -      PositionAfterLastLineInToken = RemainingTokenLength;
> > +      PositionAfterLastLineInToken = RemainingTokenColumns;
> >      }
> >
> >      if (BreakInserted) {
> > @@ -1080,13 +1083,16 @@ private:
> >    // Increasing count of \c StateNode items we have created. This is used
> >    // to create a deterministic order independent of the container.
> >    unsigned Count;
> > +  encoding::Encoding Encoding;
> >  };
> >
> >  class FormatTokenLexer {
> >  public:
> > -  FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr)
> > +  FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr,
> > +                   encoding::Encoding Encoding)
> >        : FormatTok(NULL), GreaterStashed(false), TrailingWhitespace(0), Lex(Lex),
> > -        SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()) {
> > +        SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()),
> > +        Encoding(Encoding) {
> >      Lex.SetKeepWhitespaceMode(true);
> >    }
> >
> > @@ -1111,7 +1117,8 @@ private:
> >            FormatTok->Tok.getLocation().getLocWithOffset(1);
> >        FormatTok->WhitespaceRange =
> >            SourceRange(GreaterLocation, GreaterLocation);
> > -      FormatTok->TokenLength = 1;
> > +      FormatTok->ByteCount = 1;
> > +      FormatTok->CodePointCount = 1;
> >        GreaterStashed = false;
> >        return FormatTok;
> >      }
> > @@ -1146,12 +1153,12 @@ private:
> >      }
> >
> >      // Now FormatTok is the next non-whitespace token.
> > -    FormatTok->TokenLength = Text.size();
> > +    FormatTok->ByteCount = Text.size();
> >
> >      TrailingWhitespace = 0;
> >      if (FormatTok->Tok.is(tok::comment)) {
> >        TrailingWhitespace = Text.size() - Text.rtrim().size();
> > -      FormatTok->TokenLength -= TrailingWhitespace;
> > +      FormatTok->ByteCount -= TrailingWhitespace;
> >      }
> >
> >      // In case the token starts with escaped newlines, we want to
> > @@ -1164,7 +1171,7 @@ private:
> >      while (i + 1 < Text.size() && Text[i] == '\\' && Text[i + 1] == '\n') {
> >        // FIXME: ++FormatTok->NewlinesBefore is missing...
> >        WhitespaceLength += 2;
> > -      FormatTok->TokenLength -= 2;
> > +      FormatTok->ByteCount -= 2;
> >        i += 2;
> >      }
> >
> > @@ -1176,15 +1183,19 @@ private:
> >
> >      if (FormatTok->Tok.is(tok::greatergreater)) {
> >        FormatTok->Tok.setKind(tok::greater);
> > -      FormatTok->TokenLength = 1;
> > +      FormatTok->ByteCount = 1;
> >        GreaterStashed = true;
> >      }
> >
> > +    unsigned EncodingExtraBytes =
> > +        Text.size() - encoding::getCodePointCount(Text, Encoding);
> > +    FormatTok->CodePointCount = FormatTok->ByteCount - EncodingExtraBytes;
> > +
> >      FormatTok->WhitespaceRange = SourceRange(
> >          WhitespaceStart, WhitespaceStart.getLocWithOffset(WhitespaceLength));
> >      FormatTok->TokenText = StringRef(
> >          SourceMgr.getCharacterData(FormatTok->getStartOfNonWhitespace()),
> > -        FormatTok->TokenLength);
> > +        FormatTok->ByteCount);
> >      return FormatTok;
> >    }
> >
> > @@ -1194,6 +1205,7 @@ private:
> >    Lexer &Lex;
> >    SourceManager &SourceMgr;
> >    IdentifierTable IdentTable;
> > +  encoding::Encoding Encoding;
> >    llvm::SpecificBumpPtrAllocator<FormatToken> Allocator;
> >    SmallVector<FormatToken *, 16> Tokens;
> >
> > @@ -1209,17 +1221,22 @@ public:
> >    Formatter(const FormatStyle &Style, Lexer &Lex, SourceManager &SourceMgr,
> >              const std::vector<CharSourceRange> &Ranges)
> >        : Style(Style), Lex(Lex), SourceMgr(SourceMgr),
> > -        Whitespaces(SourceMgr, Style), Ranges(Ranges) {}
> > +        Whitespaces(SourceMgr, Style), Ranges(Ranges),
> > +        Encoding(encoding::detectEncoding(Lex.getBuffer())) {
> > +    DEBUG(llvm::dbgs()
> > +          << "File encoding: "
> > +          << (Encoding == encoding::Encoding_UTF8 ? "UTF8" : "unknown")
> > +          << "\n");
> > +  }
> >
> >    virtual ~Formatter() {}
> >
> >    tooling::Replacements format() {
> > -    FormatTokenLexer Tokens(Lex, SourceMgr);
> > +    FormatTokenLexer Tokens(Lex, SourceMgr, Encoding);
> >
> >      UnwrappedLineParser Parser(Style, Tokens.lex(), *this);
> >      bool StructuralError = Parser.parse();
> > -    TokenAnnotator Annotator(Style, SourceMgr, Lex,
> > -                             Tokens.getIdentTable().get("in"));
> > +    TokenAnnotator Annotator(Style, Tokens.getIdentTable().get("in"));
> >      for (unsigned i = 0, e = AnnotatedLines.size(); i != e; ++i) {
> >        Annotator.annotate(AnnotatedLines[i]);
> >      }
> > @@ -1290,7 +1307,7 @@ public:
> >                1;
> >          }
> >          UnwrappedLineFormatter Formatter(Style, SourceMgr, TheLine, Indent,
> > -                                         TheLine.First, Whitespaces);
> > +                                         TheLine.First, Whitespaces, Encoding);
> >          Formatter.format(I + 1 != E ? &*(I + 1) : NULL);
> >          IndentForLevel[TheLine.Level] = LevelIndent;
> >          PreviousLineWasTouched = true;
> > @@ -1556,7 +1573,7 @@ private:
> >      CharSourceRange LineRange = CharSourceRange::getCharRange(
> >          First->WhitespaceRange.getBegin().getLocWithOffset(
> >              First->LastNewlineOffset),
> > -        Last->Tok.getLocation().getLocWithOffset(Last->TokenLength - 1));
> > +        Last->Tok.getLocation().getLocWithOffset(Last->ByteCount - 1));
> >      return touchesRanges(LineRange);
> >    }
> >
> > @@ -1616,6 +1633,8 @@ private:
> >    WhitespaceManager Whitespaces;
> >    std::vector<CharSourceRange> Ranges;
> >    std::vector<AnnotatedLine> AnnotatedLines;
> > +
> > +  encoding::Encoding Encoding;
> >  };
> >
> >  tooling::Replacements reformat(const FormatStyle &Style, Lexer &Lex,
> >
> > Modified: cfe/trunk/lib/Format/FormatToken.h
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/FormatToken.h?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/lib/Format/FormatToken.h (original)
> > +++ cfe/trunk/lib/Format/FormatToken.h Wed Jun  5 09:09:10 2013
> > @@ -61,11 +61,12 @@ enum TokenType {
> >  struct FormatToken {
> >    FormatToken()
> >        : NewlinesBefore(0), HasUnescapedNewline(false), LastNewlineOffset(0),
> > -        TokenLength(0), IsFirst(false), MustBreakBefore(false),
> > -        Type(TT_Unknown), SpacesRequiredBefore(0), CanBreakBefore(false),
> > -        ClosesTemplateDeclaration(false), ParameterCount(0), TotalLength(0),
> > -        UnbreakableTailLength(0), BindingStrength(0), SplitPenalty(0),
> > -        LongestObjCSelectorName(0), FakeRParens(0), LastInChainOfCalls(false),
> > +        ByteCount(0), CodePointCount(0), IsFirst(false),
> > +        MustBreakBefore(false), Type(TT_Unknown), SpacesRequiredBefore(0),
> > +        CanBreakBefore(false), ClosesTemplateDeclaration(false),
> > +        ParameterCount(0), TotalLength(0), UnbreakableTailLength(0),
> > +        BindingStrength(0), SplitPenalty(0), LongestObjCSelectorName(0),
> > +        FakeRParens(0), LastInChainOfCalls(false),
> >          PartOfMultiVariableDeclStmt(false), MatchingParen(NULL), Previous(NULL),
> >          Next(NULL) {}
> >
> > @@ -89,10 +90,14 @@ struct FormatToken {
> >    /// whitespace (relative to \c WhiteSpaceStart). 0 if there is no '\n'.
> >    unsigned LastNewlineOffset;
> >
> > -  /// \brief The length of the non-whitespace parts of the token. This is
> > -  /// necessary because we need to handle escaped newlines that are stored
> > +  /// \brief The number of bytes of the non-whitespace parts of the token. This
> > +  /// is necessary because we need to handle escaped newlines that are stored
> >    /// with the token.
> > -  unsigned TokenLength;
> > +  unsigned ByteCount;
> > +
> > +  /// \brief The length of the non-whitespace parts of the token in CodePoints.
> > +  /// We need this to correctly measure number of columns a token spans.
> > +  unsigned CodePointCount;
> >
> >    /// \brief Indicates that this is the first token.
> >    bool IsFirst;
> >
> > Modified: cfe/trunk/lib/Format/TokenAnnotator.cpp
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.cpp?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/lib/Format/TokenAnnotator.cpp (original)
> > +++ cfe/trunk/lib/Format/TokenAnnotator.cpp Wed Jun  5 09:09:10 2013
> > @@ -15,7 +15,6 @@
> >
> >  #include "TokenAnnotator.h"
> >  #include "clang/Basic/SourceManager.h"
> > -#include "clang/Lex/Lexer.h"
> >  #include "llvm/Support/Debug.h"
> >
> >  namespace clang {
> > @@ -28,10 +27,9 @@ namespace format {
> >  /// into template parameter lists.
> >  class AnnotatingParser {
> >  public:
> > -  AnnotatingParser(SourceManager &SourceMgr, Lexer &Lex, AnnotatedLine &Line,
> > -                   IdentifierInfo &Ident_in)
> > -      : SourceMgr(SourceMgr), Lex(Lex), Line(Line), CurrentToken(Line.First),
> > -        KeywordVirtualFound(false), NameFound(false), Ident_in(Ident_in) {
> > +  AnnotatingParser(AnnotatedLine &Line, IdentifierInfo &Ident_in)
> > +      : Line(Line), CurrentToken(Line.First), KeywordVirtualFound(false),
> > +        NameFound(false), Ident_in(Ident_in) {
> >      Contexts.push_back(Context(tok::unknown, 1, /*IsExpression=*/ false));
> >    }
> >
> > @@ -295,9 +293,11 @@ private:
> >                   Line.First->Type == TT_ObjCMethodSpecifier) {
> >          Tok->Type = TT_ObjCMethodExpr;
> >          Tok->Previous->Type = TT_ObjCSelectorName;
> > -        if (Tok->Previous->TokenLength >
> > -            Contexts.back().LongestObjCSelectorName)
> > -          Contexts.back().LongestObjCSelectorName = Tok->Previous->TokenLength;
> > +        if (Tok->Previous->CodePointCount >
> > +            Contexts.back().LongestObjCSelectorName) {
> > +          Contexts.back().LongestObjCSelectorName =
> > +              Tok->Previous->CodePointCount;
> > +        }
> >          if (Contexts.back().FirstObjCSelectorName == NULL)
> >            Contexts.back().FirstObjCSelectorName = Tok->Previous;
> >        } else if (Contexts.back().ColonIsForRangeExpr) {
> > @@ -602,9 +602,7 @@ private:
> >        } else if (Current.isBinaryOperator()) {
> >          Current.Type = TT_BinaryOperator;
> >        } else if (Current.is(tok::comment)) {
> > -        std::string Data(
> > -            Lexer::getSpelling(Current.Tok, SourceMgr, Lex.getLangOpts()));
> > -        if (StringRef(Data).startswith("//"))
> > +        if (Current.TokenText.startswith("//"))
> >            Current.Type = TT_LineComment;
> >          else
> >            Current.Type = TT_BlockComment;
> > @@ -748,23 +746,19 @@ private:
> >      case tok::kw_wchar_t:
> >      case tok::kw_bool:
> >      case tok::kw___underlying_type:
> > -      return true;
> >      case tok::annot_typename:
> >      case tok::kw_char16_t:
> >      case tok::kw_char32_t:
> >      case tok::kw_typeof:
> >      case tok::kw_decltype:
> > -      return Lex.getLangOpts().CPlusPlus;
> > +      return true;
> >      default:
> > -      break;
> > +      return false;
> >      }
> > -    return false;
> >    }
> >
> >    SmallVector<Context, 8> Contexts;
> >
> > -  SourceManager &SourceMgr;
> > -  Lexer &Lex;
> >    AnnotatedLine &Line;
> >    FormatToken *CurrentToken;
> >    bool KeywordVirtualFound;
> > @@ -866,7 +860,7 @@ private:
> >  };
> >
> >  void TokenAnnotator::annotate(AnnotatedLine &Line) {
> > -  AnnotatingParser Parser(SourceMgr, Lex, Line, Ident_in);
> > +  AnnotatingParser Parser(Line, Ident_in);
> >    Line.Type = Parser.parseLine();
> >    if (Line.Type == LT_Invalid)
> >      return;
> > @@ -886,7 +880,7 @@ void TokenAnnotator::annotate(AnnotatedL
> >  }
> >
> >  void TokenAnnotator::calculateFormattingInformation(AnnotatedLine &Line) {
> > -  Line.First->TotalLength = Line.First->TokenLength;
> > +  Line.First->TotalLength = Line.First->CodePointCount;
> >    if (!Line.First->Next)
> >      return;
> >    FormatToken *Current = Line.First->Next;
> > @@ -920,7 +914,7 @@ void TokenAnnotator::calculateFormatting
> >        Current->TotalLength = Current->Previous->TotalLength + Style.ColumnLimit;
> >      else
> >        Current->TotalLength =
> > -          Current->Previous->TotalLength + Current->TokenLength +
> > +          Current->Previous->TotalLength + Current->CodePointCount +
> >            Current->SpacesRequiredBefore;
> >      // FIXME: Only calculate this if CanBreakBefore is true once static
> >      // initializers etc. are sorted out.
> > @@ -947,7 +941,7 @@ void TokenAnnotator::calculateUnbreakabl
> >        UnbreakableTailLength = 0;
> >      } else {
> >        UnbreakableTailLength +=
> > -          Current->TokenLength + Current->SpacesRequiredBefore;
> > +          Current->CodePointCount + Current->SpacesRequiredBefore;
> >      }
> >      Current = Current->Previous;
> >    }
> > @@ -1015,8 +1009,7 @@ unsigned TokenAnnotator::splitPenalty(co
> >
> >    if (Right.is(tok::lessless)) {
> >      if (Left.is(tok::string_literal)) {
> > -      StringRef Content =
> > -          StringRef(Left.Tok.getLiteralData(), Left.TokenLength);
> > +      StringRef Content = Left.TokenText;
> >        Content = Content.drop_back(1).drop_front(1).trim();
> >        if (Content.size() > 1 &&
> >            (Content.back() == ':' || Content.back() == '='))
> >
> > Modified: cfe/trunk/lib/Format/TokenAnnotator.h
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.h?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/lib/Format/TokenAnnotator.h (original)
> > +++ cfe/trunk/lib/Format/TokenAnnotator.h Wed Jun  5 09:09:10 2013
> > @@ -21,7 +21,6 @@
> >  #include <string>
> >
> >  namespace clang {
> > -class Lexer;
> >  class SourceManager;
> >
> >  namespace format {
> > @@ -71,10 +70,8 @@ public:
> >  /// \c UnwrappedLine.
> >  class TokenAnnotator {
> >  public:
> > -  TokenAnnotator(const FormatStyle &Style, SourceManager &SourceMgr, Lexer &Lex,
> > -                 IdentifierInfo &Ident_in)
> > -      : Style(Style), SourceMgr(SourceMgr), Lex(Lex), Ident_in(Ident_in) {
> > -  }
> > +  TokenAnnotator(const FormatStyle &Style, IdentifierInfo &Ident_in)
> > +      : Style(Style), Ident_in(Ident_in) {}
> >
> >    void annotate(AnnotatedLine &Line);
> >    void calculateFormattingInformation(AnnotatedLine &Line);
> > @@ -95,8 +92,6 @@ private:
> >    void calculateUnbreakableTailLengths(AnnotatedLine &Line);
> >
> >    const FormatStyle &Style;
> > -  SourceManager &SourceMgr;
> > -  Lexer &Lex;
> >
> >    // Contextual keywords:
> >    IdentifierInfo &Ident_in;
> >
> > Modified: cfe/trunk/unittests/Format/FormatTest.cpp
> > URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/unittests/Format/FormatTest.cpp?rev=183312&r1=183311&r2=183312&view=diff
> > ==============================================================================
> > --- cfe/trunk/unittests/Format/FormatTest.cpp (original)
> > +++ cfe/trunk/unittests/Format/FormatTest.cpp Wed Jun  5 09:09:10 2013
> > @@ -4873,5 +4873,80 @@ TEST_F(FormatTest, ConfigurationRoundTri
> >    EXPECT_EQ(Style, ParsedStyle);
> >  }
> >
> > +TEST_F(FormatTest, WorksFor8bitEncodings) {
> > +  EXPECT_EQ("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 \"\n"
> > +            "\"\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \"\n"
> > +            "\"\xe7\xe8\xec\xed\xfe\xfe \"\n"
> > +            "\"\xef\xee\xf0\xf3...\"",
> > +            format("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 "
> > +                   "\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \xe7\xe8\xec\xed\xfe\xfe "
> > +                   "\xef\xee\xf0\xf3...\"",
> > +                   getLLVMStyleWithColumns(12)));
> > +}
> > +
> > +TEST_F(FormatTest, CountsUTF8CharactersProperly) {
> > +  verifyFormat("\"Однажды в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору...\"",
> > +               getLLVMStyleWithColumns(35));
> > +  verifyFormat("\"一 二 三 å›› 五 å…­ 七 å…« ä¹  å  \"",
> > +               getLLVMStyleWithColumns(21));
> > +  verifyFormat("// Однажды в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору...",
> > +               getLLVMStyleWithColumns(36));
> > +  verifyFormat("// 一 二 三 å›› 五 å…­ 七 å…« ä¹  å  ",
> > +               getLLVMStyleWithColumns(22));
> > +  verifyFormat("/* Однажды в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору... */",
> > +               getLLVMStyleWithColumns(39));
> > +  verifyFormat("/* 一 二 三 å›› 五 å…­ 七 å…« ä¹  å   */",
> > +               getLLVMStyleWithColumns(25));
> > +}
> > +
> > +TEST_F(FormatTest, SplitsUTF8Strings) {
> > +  EXPECT_EQ(
> > +      "\"Однажды, в \"\n"
> > +      "\"Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ \"\n"
> > +      "\"зимнюю \"\n"
> > +      "\"пору,\"",
> > +      format("\"Однажды, в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору,\"",
> > +             getLLVMStyleWithColumns(13)));
> > +  EXPECT_EQ("\"一 二 三 四 \"\n"
> > +            "\"五 六 七 八 \"\n"
> > +            "\"ä¹  å  \"",
> > +            format("\"一 二 三 å›› 五 å…­ 七 å…« ä¹  å  \"",
> > +                   getLLVMStyleWithColumns(10)));
> > +}
> > +
> > +TEST_F(FormatTest, SplitsUTF8LineComments) {
> > +  EXPECT_EQ("// Я из Ð»ÐµÑ Ñƒ\n"
> > +            "// вышел; был\n"
> > +            "// Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹\n"
> > +            "// мороз.",
> > +            format("// Я из Ð»ÐµÑ Ñƒ вышел; был Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹ мороз.",
> > +                   getLLVMStyleWithColumns(13)));
> > +  EXPECT_EQ("// 一二三\n"
> > +            "// 四五六七\n"
> > +            "// å…«\n"
> > +            "// ä¹  å  ",
> > +            format("// 一二三 四五六七 å…«  ä¹  å  ", getLLVMStyleWithColumns(6)));
> > +}
> > +
> > +TEST_F(FormatTest, SplitsUTF8BlockComments) {
> > +  EXPECT_EQ("/* Ð“Ð»Ñ Ð¶Ñƒ,\n"
> > +            " * Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ \n"
> > +            " * медленно в\n"
> > +            " * гору\n"
> > +            " * Лошадка,\n"
> > +            " * Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ \n"
> > +            " * Ñ…Ð²Ð¾Ñ€Ð¾Ñ Ñ‚Ñƒ\n"
> > +            " * воз. */",
> > +            format("/* Ð“Ð»Ñ Ð¶Ñƒ, Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ  медленно в гору\n"
> > +                   " * Лошадка, Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ  Ñ…Ð²Ð¾Ñ€Ð¾Ñ Ñ‚Ñƒ воз. */",
> > +                   getLLVMStyleWithColumns(13)));
> > +  EXPECT_EQ("/* 一二三\n"
> > +            " * 四五六七\n"
> > +            " * å…«\n"
> > +            " * ä¹  å  \n"
> > +            " */",
> > +            format("/* 一二三 四五六七 å…«  ä¹  å   */", getLLVMStyleWithColumns(6)));
> > +}
> > +
> >  } // end namespace tooling
> >  } // end namespace clang
> >
> >
> > _______________________________________________
> > cfe-commits mailing list
> > cfe-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
> 
> 
> _______________________________________________
> cfe-commits mailing list
> cfe-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
> 
> 
> 
> 
> 
> 
> -- 
> Alexander Kornienko | Software Engineer | alexfh at google.com | +49 151 221 77 957
> Google Germany GmbH | Dienerstr. 12 | 80331 München
> _______________________________________________
> cfe-commits mailing list
> cfe-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20130607/78c7a5be/attachment.html>


More information about the cfe-commits mailing list