<div dir="ltr"><div>On Fri, Jun 7, 2013 at 6:52 AM, Alexander Kornienko <span dir="ltr"><<a href="mailto:alexfh@google.com" target="_blank">alexfh@google.com</a>></span> wrote:<br></div><div class="gmail_extra"><div class="gmail_quote">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="im">On Fri, Jun 7, 2013 at 6:46 AM, Nico Weber <span dir="ltr"><<a href="mailto:thakis@chromium.org" target="_blank">thakis@chromium.org</a>></span> wrote:<br>

</div><div class="gmail_extra"><div class="gmail_quote"><div class="im">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div>On Thu, Jun 6, 2013 at 4:49 PM, Alexander Kornienko <span dir="ltr"><<a href="mailto:alexfh@google.com" target="_blank">alexfh@google.com</a>></span> wrote:<br>


</div><div class="gmail_extra"><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div>On Thu, Jun 6, 2013 at 1:11 AM, NAKAMURA Takumi <span dir="ltr"><<a href="mailto:geek4civic@gmail.com" target="_blank">geek4civic@gmail.com</a>></span> wrote:<br>



</div><div class="gmail_extra"><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I wonder the source file could contain utf8 characters.<br>




</blockquote><div><br></div></div><div>It's implementation-defined behavior. Apparently GCC and Clang handle this correctly.</div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">





In fact, MS cl.exe misdetects charsets against rather system<br>
codepage(932) than current codepage (65001), without BOM.<br></blockquote><div> </div></div><div>Seems like adding UTF-8 BOM is the only way to force MSVC treat a source file as UTF-8. But this is not supported by GCC and Clang, AFAIK.</div>



</div></div></div></blockquote><div><br></div></div><div>clang's Lexer::InitLexer() skips BOMs.</div></div></div></div></blockquote><div><br></div></div><div>Sounds interesting. And <a href="http://stackoverflow.com/questions/7899795/is-it-possible-to-get-gcc-to-compile-utf-8-with-bom-source-files" target="_blank">here</a> they say that GCC also supports this. I've checked with Clang trunk and GCC 4.6.3, and it works. Then are there any reasons not to just add UTF-8 BOM?</div>

</div></div></div></blockquote><div><br></div><div style>The Unicode Standard 6.0 Core Spec said use of a BOM is neither required nor recommended for UTF-8 (p.30). Treating the BOM as magic bytes indicating that the file is in UTF-8 encoding seems too Microsoft specific. So I guess adding a BOM unconditionally may not be a good idea.</div>

<div style><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div class="h5"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">
<div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>
<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Could you get rid of raw utf8 characters and encode them in literals?<br>
FYI, I can see Cyrillic and CJK :)<br></blockquote></div><div><div><br></div><div>There's a plan to make some of our tests file-based instead of unit tests. I think, utf-8 tests are the first candidate for this. As UTF-8 support is not the most important thing for Windows builds of clang-format, I'd leave the new tests just #ifdefed out for now. BTW, thanks for doing this.</div>




</div><div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
...Takumi<br>
<br>
2013/6/5 Alexander Kornienko <<a href="mailto:alexfh@google.com" target="_blank">alexfh@google.com</a>>:<br>
<div><div>> Author: alexfh<br>
> Date: Wed Jun  5 09:09:10 2013<br>
> New Revision: 183312<br>
><br>
> URL: <a href="http://llvm.org/viewvc/llvm-project?rev=183312&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=183312&view=rev</a><br>
> Log:<br>
> UTF-8 support for clang-format.<br>
><br>
> Summary:<br>
> Detect if the file is valid UTF-8, and if this is the case, count code<br>
> points instead of just using number of bytes in all (hopefully) places, where<br>
> number of columns is needed. In particular, use the new<br>
> FormatToken.CodePointCount instead of TokenLength where appropriate.<br>
> Changed BreakableToken implementations to respect utf-8 character boundaries<br>
> when in utf-8 mode.<br>
><br>
> Reviewers: klimek, djasper<br>
><br>
> Reviewed By: djasper<br>
><br>
> CC: cfe-commits, rsmith, gribozavr<br>
><br>
> Differential Revision: <a href="http://llvm-reviews.chandlerc.com/D918" target="_blank">http://llvm-reviews.chandlerc.com/D918</a><br>
><br>
> Added:<br>
>     cfe/trunk/lib/Format/Encoding.h<br>
> Modified:<br>
>     cfe/trunk/lib/Format/BreakableToken.cpp<br>
>     cfe/trunk/lib/Format/BreakableToken.h<br>
>     cfe/trunk/lib/Format/Format.cpp<br>
>     cfe/trunk/lib/Format/FormatToken.h<br>
>     cfe/trunk/lib/Format/TokenAnnotator.cpp<br>
>     cfe/trunk/lib/Format/TokenAnnotator.h<br>
>     cfe/trunk/unittests/Format/FormatTest.cpp<br>
><br>
> Modified: cfe/trunk/lib/Format/BreakableToken.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/BreakableToken.cpp (original)<br>
> +++ cfe/trunk/lib/Format/BreakableToken.cpp Wed Jun  5 09:09:10 2013<br>
> @@ -25,66 +25,22 @@ namespace clang {<br>
>  namespace format {<br>
>  namespace {<br>
><br>
> -// FIXME: Move helper string functions to where it makes sense.<br>
> -<br>
> -unsigned getOctalLength(StringRef Text) {<br>
> -  unsigned I = 1;<br>
> -  while (I < Text.size() && I < 4 && (Text[I] >= '0' && Text[I] <= '7')) {<br>
> -    ++I;<br>
> -  }<br>
> -  return I;<br>
> -}<br>
> -<br>
> -unsigned getHexLength(StringRef Text) {<br>
> -  unsigned I = 2; // Point after '\x'.<br>
> -  while (I < Text.size() && ((Text[I] >= '0' && Text[I] <= '9') ||<br>
> -                             (Text[I] >= 'a' && Text[I] <= 'f') ||<br>
> -                             (Text[I] >= 'A' && Text[I] <= 'F'))) {<br>
> -    ++I;<br>
> -  }<br>
> -  return I;<br>
> -}<br>
> -<br>
> -unsigned getEscapeSequenceLength(StringRef Text) {<br>
> -  assert(Text[0] == '\\');<br>
> -  if (Text.size() < 2)<br>
> -    return 1;<br>
> -<br>
> -  switch (Text[1]) {<br>
> -  case 'u':<br>
> -    return 6;<br>
> -  case 'U':<br>
> -    return 10;<br>
> -  case 'x':<br>
> -    return getHexLength(Text);<br>
> -  default:<br>
> -    if (Text[1] >= '0' && Text[1] <= '7')<br>
> -      return getOctalLength(Text);<br>
> -    return 2;<br>
> -  }<br>
> -}<br>
> -<br>
> -StringRef::size_type getStartOfCharacter(StringRef Text,<br>
> -                                         StringRef::size_type Offset) {<br>
> -  StringRef::size_type NextEscape = Text.find('\\');<br>
> -  while (NextEscape != StringRef::npos && NextEscape < Offset) {<br>
> -    StringRef::size_type SequenceLength =<br>
> -        getEscapeSequenceLength(Text.substr(NextEscape));<br>
> -    if (Offset < NextEscape + SequenceLength)<br>
> -      return NextEscape;<br>
> -    NextEscape = Text.find('\\', NextEscape + SequenceLength);<br>
> -  }<br>
> -  return Offset;<br>
> -}<br>
> -<br>
>  BreakableToken::Split getCommentSplit(StringRef Text,<br>
>                                        unsigned ContentStartColumn,<br>
> -                                      unsigned ColumnLimit) {<br>
> +                                      unsigned ColumnLimit,<br>
> +                                      encoding::Encoding Encoding) {<br>
>    if (ColumnLimit <= ContentStartColumn + 1)<br>
>      return BreakableToken::Split(StringRef::npos, 0);<br>
><br>
>    unsigned MaxSplit = ColumnLimit - ContentStartColumn + 1;<br>
> -  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);<br>
> +  unsigned MaxSplitBytes = 0;<br>
> +<br>
> +  for (unsigned NumChars = 0;<br>
> +       NumChars < MaxSplit && MaxSplitBytes < Text.size(); ++NumChars)<br>
> +    MaxSplitBytes +=<br>
> +        encoding::getCodePointNumBytes(Text[MaxSplitBytes], Encoding);<br>
> +<br>
> +  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplitBytes);<br>
>    if (SpaceOffset == StringRef::npos ||<br>
>        // Don't break at leading whitespace.<br>
>        Text.find_last_not_of(' ', SpaceOffset) == StringRef::npos) {<br>
> @@ -95,7 +51,7 @@ BreakableToken::Split getCommentSplit(St<br>
>        // If the comment is only whitespace, we cannot split.<br>
>        return BreakableToken::Split(StringRef::npos, 0);<br>
>      SpaceOffset =<br>
> -        Text.find(' ', std::max<unsigned>(MaxSplit, FirstNonWhitespace));<br>
> +        Text.find(' ', std::max<unsigned>(MaxSplitBytes, FirstNonWhitespace));<br>
>    }<br>
>    if (SpaceOffset != StringRef::npos && SpaceOffset != 0) {<br>
>      StringRef BeforeCut = Text.substr(0, SpaceOffset).rtrim();<br>
> @@ -108,25 +64,48 @@ BreakableToken::Split getCommentSplit(St<br>
><br>
>  BreakableToken::Split getStringSplit(StringRef Text,<br>
>                                       unsigned ContentStartColumn,<br>
> -                                     unsigned ColumnLimit) {<br>
> -<br>
> -  if (ColumnLimit <= ContentStartColumn)<br>
> -    return BreakableToken::Split(StringRef::npos, 0);<br>
> -  unsigned MaxSplit = ColumnLimit - ContentStartColumn;<br>
> +                                     unsigned ColumnLimit,<br>
> +                                     encoding::Encoding Encoding) {<br>
>    // FIXME: Reduce unit test case.<br>
>    if (Text.empty())<br>
>      return BreakableToken::Split(StringRef::npos, 0);<br>
> -  MaxSplit = std::min<unsigned>(MaxSplit, Text.size() - 1);<br>
> -  StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);<br>
> -  if (SpaceOffset != StringRef::npos && SpaceOffset != 0)<br>
> +  if (ColumnLimit <= ContentStartColumn)<br>
> +    return BreakableToken::Split(StringRef::npos, 0);<br>
> +  unsigned MaxSplit =<br>
> +      std::min<unsigned>(ColumnLimit - ContentStartColumn,<br>
> +                         encoding::getCodePointCount(Text, Encoding) - 1);<br>
> +  StringRef::size_type SpaceOffset = 0;<br>
> +  StringRef::size_type SlashOffset = 0;<br>
> +  StringRef::size_type SplitPoint = 0;<br>
> +  for (unsigned Chars = 0;;) {<br>
> +    unsigned Advance;<br>
> +    if (Text[0] == '\\') {<br>
> +      Advance = encoding::getEscapeSequenceLength(Text);<br>
> +      Chars += Advance;<br>
> +    } else {<br>
> +      Advance = encoding::getCodePointNumBytes(Text[0], Encoding);<br>
> +      Chars += 1;<br>
> +    }<br>
> +<br>
> +    if (Chars > MaxSplit)<br>
> +      break;<br>
> +<br>
> +    if (Text[0] == ' ')<br>
> +      SpaceOffset = SplitPoint;<br>
> +    if (Text[0] == '/')<br>
> +      SlashOffset = SplitPoint;<br>
> +<br>
> +    SplitPoint += Advance;<br>
> +    Text = Text.substr(Advance);<br>
> +  }<br>
> +<br>
> +  if (SpaceOffset != 0)<br>
>      return BreakableToken::Split(SpaceOffset + 1, 0);<br>
> -  StringRef::size_type SlashOffset = Text.rfind('/', MaxSplit);<br>
> -  if (SlashOffset != StringRef::npos && SlashOffset != 0)<br>
> +  if (SlashOffset != 0)<br>
>      return BreakableToken::Split(SlashOffset + 1, 0);<br>
> -  StringRef::size_type SplitPoint = getStartOfCharacter(Text, MaxSplit);<br>
> -  if (SplitPoint == StringRef::npos || SplitPoint == 0)<br>
> -    return BreakableToken::Split(StringRef::npos, 0);<br>
> -  return BreakableToken::Split(SplitPoint, 0);<br>
> +  if (SplitPoint != 0)<br>
> +    return BreakableToken::Split(SplitPoint, 0);<br>
> +  return BreakableToken::Split(StringRef::npos, 0);<br>
>  }<br>
><br>
>  } // namespace<br>
> @@ -136,8 +115,8 @@ unsigned BreakableSingleLineToken::getLi<br>
>  unsigned<br>
>  BreakableSingleLineToken::getLineLengthAfterSplit(unsigned LineIndex,<br>
>                                                    unsigned TailOffset) const {<br>
> -  return StartColumn + Prefix.size() + Postfix.size() + Line.size() -<br>
> -         TailOffset;<br>
> +  return StartColumn + Prefix.size() + Postfix.size() +<br>
> +         encoding::getCodePointCount(Line.substr(TailOffset), Encoding);<br>
>  }<br>
><br>
>  void BreakableSingleLineToken::insertBreak(unsigned LineIndex,<br>
> @@ -152,8 +131,9 @@ void BreakableSingleLineToken::insertBre<br>
>  BreakableSingleLineToken::BreakableSingleLineToken(const FormatToken &Tok,<br>
>                                                     unsigned StartColumn,<br>
>                                                     StringRef Prefix,<br>
> -                                                   StringRef Postfix)<br>
> -    : BreakableToken(Tok), StartColumn(StartColumn), Prefix(Prefix),<br>
> +                                                   StringRef Postfix,<br>
> +                                                   encoding::Encoding Encoding)<br>
> +    : BreakableToken(Tok, Encoding), StartColumn(StartColumn), Prefix(Prefix),<br>
>        Postfix(Postfix) {<br>
>    assert(Tok.TokenText.startswith(Prefix) && Tok.TokenText.endswith(Postfix));<br>
>    Line = Tok.TokenText.substr(<br>
> @@ -161,13 +141,15 @@ BreakableSingleLineToken::BreakableSingl<br>
>  }<br>
><br>
>  BreakableStringLiteral::BreakableStringLiteral(const FormatToken &Tok,<br>
> -                                               unsigned StartColumn)<br>
> -    : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"") {}<br>
> +                                               unsigned StartColumn,<br>
> +                                               encoding::Encoding Encoding)<br>
> +    : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"", Encoding) {}<br>
><br>
>  BreakableToken::Split<br>
>  BreakableStringLiteral::getSplit(unsigned LineIndex, unsigned TailOffset,<br>
>                                   unsigned ColumnLimit) const {<br>
> -  return getStringSplit(Line.substr(TailOffset), StartColumn + 2, ColumnLimit);<br>
> +  return getStringSplit(Line.substr(TailOffset), StartColumn + 2, ColumnLimit,<br>
> +                        Encoding);<br>
>  }<br>
><br>
>  static StringRef getLineCommentPrefix(StringRef Comment) {<br>
> @@ -179,23 +161,23 @@ static StringRef getLineCommentPrefix(St<br>
>  }<br>
><br>
>  BreakableLineComment::BreakableLineComment(const FormatToken &Token,<br>
> -                                           unsigned StartColumn)<br>
> +                                           unsigned StartColumn,<br>
> +                                           encoding::Encoding Encoding)<br>
>      : BreakableSingleLineToken(Token, StartColumn,<br>
> -                               getLineCommentPrefix(Token.TokenText), "") {}<br>
> +                               getLineCommentPrefix(Token.TokenText), "",<br>
> +                               Encoding) {}<br>
><br>
>  BreakableToken::Split<br>
>  BreakableLineComment::getSplit(unsigned LineIndex, unsigned TailOffset,<br>
>                                 unsigned ColumnLimit) const {<br>
>    return getCommentSplit(Line.substr(TailOffset), StartColumn + Prefix.size(),<br>
> -                         ColumnLimit);<br>
> +                         ColumnLimit, Encoding);<br>
>  }<br>
><br>
> -BreakableBlockComment::BreakableBlockComment(const FormatStyle &Style,<br>
> -                                             const FormatToken &Token,<br>
> -                                             unsigned StartColumn,<br>
> -                                             unsigned OriginalStartColumn,<br>
> -                                             bool FirstInLine)<br>
> -    : BreakableToken(Token) {<br>
> +BreakableBlockComment::BreakableBlockComment(<br>
> +    const FormatStyle &Style, const FormatToken &Token, unsigned StartColumn,<br>
> +    unsigned OriginalStartColumn, bool FirstInLine, encoding::Encoding Encoding)<br>
> +    : BreakableToken(Token, Encoding) {<br>
>    StringRef TokenText(Token.TokenText);<br>
>    assert(TokenText.startswith("/*") && TokenText.endswith("*/"));<br>
>    TokenText.substr(2, TokenText.size() - 4).split(Lines, "\n");<br>
> @@ -290,7 +272,8 @@ unsigned<br>
>  BreakableBlockComment::getLineLengthAfterSplit(unsigned LineIndex,<br>
>                                                 unsigned TailOffset) const {<br>
>    return getContentStartColumn(LineIndex, TailOffset) +<br>
> -         (Lines[LineIndex].size() - TailOffset) +<br>
> +         encoding::getCodePointCount(Lines[LineIndex].substr(TailOffset),<br>
> +                                     Encoding) +<br>
>           // The last line gets a "*/" postfix.<br>
>           (LineIndex + 1 == Lines.size() ? 2 : 0);<br>
>  }<br>
> @@ -300,7 +283,7 @@ BreakableBlockComment::getSplit(unsigned<br>
>                                  unsigned ColumnLimit) const {<br>
>    return getCommentSplit(Lines[LineIndex].substr(TailOffset),<br>
>                           getContentStartColumn(LineIndex, TailOffset),<br>
> -                         ColumnLimit);<br>
> +                         ColumnLimit, Encoding);<br>
>  }<br>
><br>
>  void BreakableBlockComment::insertBreak(unsigned LineIndex, unsigned TailOffset,<br>
><br>
> Modified: cfe/trunk/lib/Format/BreakableToken.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.h?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.h?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/BreakableToken.h (original)<br>
> +++ cfe/trunk/lib/Format/BreakableToken.h Wed Jun  5 09:09:10 2013<br>
> @@ -17,6 +17,7 @@<br>
>  #ifndef LLVM_CLANG_FORMAT_BREAKABLETOKEN_H<br>
>  #define LLVM_CLANG_FORMAT_BREAKABLETOKEN_H<br>
><br>
> +#include "Encoding.h"<br>
>  #include "TokenAnnotator.h"<br>
>  #include "WhitespaceManager.h"<br>
>  #include <utility><br>
> @@ -65,9 +66,11 @@ public:<br>
>                                         WhitespaceManager &Whitespaces) {}<br>
><br>
>  protected:<br>
> -  BreakableToken(const FormatToken &Tok) : Tok(Tok) {}<br>
> +  BreakableToken(const FormatToken &Tok, encoding::Encoding Encoding)<br>
> +      : Tok(Tok), Encoding(Encoding) {}<br>
><br>
>    const FormatToken &Tok;<br>
> +  encoding::Encoding Encoding;<br>
>  };<br>
><br>
>  /// \brief Base class for single line tokens that can be broken.<br>
> @@ -83,7 +86,8 @@ public:<br>
><br>
>  protected:<br>
>    BreakableSingleLineToken(const FormatToken &Tok, unsigned StartColumn,<br>
> -                           StringRef Prefix, StringRef Postfix);<br>
> +                           StringRef Prefix, StringRef Postfix,<br>
> +                           encoding::Encoding Encoding);<br>
><br>
>    // The column in which the token starts.<br>
>    unsigned StartColumn;<br>
> @@ -101,7 +105,8 @@ public:<br>
>    ///<br>
>    /// \p StartColumn specifies the column in which the token will start<br>
>    /// after formatting.<br>
> -  BreakableStringLiteral(const FormatToken &Tok, unsigned StartColumn);<br>
> +  BreakableStringLiteral(const FormatToken &Tok, unsigned StartColumn,<br>
> +                         encoding::Encoding Encoding);<br>
><br>
>    virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,<br>
>                           unsigned ColumnLimit) const;<br>
> @@ -113,7 +118,8 @@ public:<br>
>    ///<br>
>    /// \p StartColumn specifies the column in which the comment will start<br>
>    /// after formatting.<br>
> -  BreakableLineComment(const FormatToken &Token, unsigned StartColumn);<br>
> +  BreakableLineComment(const FormatToken &Token, unsigned StartColumn,<br>
> +                       encoding::Encoding Encoding);<br>
><br>
>    virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,<br>
>                           unsigned ColumnLimit) const;<br>
> @@ -129,7 +135,7 @@ public:<br>
>    /// If the comment starts a line after formatting, set \p FirstInLine to true.<br>
>    BreakableBlockComment(const FormatStyle &Style, const FormatToken &Token,<br>
>                          unsigned StartColumn, unsigned OriginaStartColumn,<br>
> -                        bool FirstInLine);<br>
> +                        bool FirstInLine, encoding::Encoding Encoding);<br>
><br>
>    virtual unsigned getLineCount() const;<br>
>    virtual unsigned getLineLengthAfterSplit(unsigned LineIndex,<br>
><br>
> Added: cfe/trunk/lib/Format/Encoding.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Encoding.h?rev=183312&view=auto" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Encoding.h?rev=183312&view=auto</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/Encoding.h (added)<br>
> +++ cfe/trunk/lib/Format/Encoding.h Wed Jun  5 09:09:10 2013<br>
> @@ -0,0 +1,114 @@<br>
> +//===--- Encoding.h - Format C++ code -------------------------------------===//<br>
> +//<br>
> +//                     The LLVM Compiler Infrastructure<br>
> +//<br>
> +// This file is distributed under the University of Illinois Open Source<br>
> +// License. See LICENSE.TXT for details.<br>
> +//<br>
> +//===----------------------------------------------------------------------===//<br>
> +///<br>
> +/// \file<br>
> +/// \brief Contains functions for text encoding manipulation. Supports UTF-8,<br>
> +/// 8-bit encodings and escape sequences in C++ string literals.<br>
> +///<br>
> +//===----------------------------------------------------------------------===//<br>
> +<br>
> +#ifndef LLVM_CLANG_FORMAT_ENCODING_H<br>
> +#define LLVM_CLANG_FORMAT_ENCODING_H<br>
> +<br>
> +#include "clang/Basic/LLVM.h"<br>
> +#include "llvm/Support/ConvertUTF.h"<br>
> +<br>
> +namespace clang {<br>
> +namespace format {<br>
> +namespace encoding {<br>
> +<br>
> +enum Encoding {<br>
> +  Encoding_UTF8,<br>
> +  Encoding_Unknown // We treat all other encodings as 8-bit encodings.<br>
> +};<br>
> +<br>
> +/// \brief Detects encoding of the Text. If the Text can be decoded using UTF-8,<br>
> +/// it is considered UTF8, otherwise we treat it as some 8-bit encoding.<br>
> +inline Encoding detectEncoding(StringRef Text) {<br>
> +  const UTF8 *Ptr = reinterpret_cast<const UTF8 *>(Text.begin());<br>
> +  const UTF8 *BufEnd = reinterpret_cast<const UTF8 *>(Text.end());<br>
> +  if (::isLegalUTF8String(&Ptr, BufEnd))<br>
> +    return Encoding_UTF8;<br>
> +  return Encoding_Unknown;<br>
> +}<br>
> +<br>
> +inline unsigned getCodePointCountUTF8(StringRef Text) {<br>
> +  unsigned CodePoints = 0;<br>
> +  for (size_t i = 0, e = Text.size(); i < e; i += getNumBytesForUTF8(Text[i])) {<br>
> +    ++CodePoints;<br>
> +  }<br>
> +  return CodePoints;<br>
> +}<br>
> +<br>
> +/// \brief Gets the number of code points in the Text using the specified<br>
> +/// Encoding.<br>
> +inline unsigned getCodePointCount(StringRef Text, Encoding Encoding) {<br>
> +  switch (Encoding) {<br>
> +    case Encoding_UTF8:<br>
> +      return getCodePointCountUTF8(Text);<br>
> +    default:<br>
> +      return Text.size();<br>
> +  }<br>
> +}<br>
> +<br>
> +/// \brief Gets the number of bytes in a sequence representing a single<br>
> +/// codepoint and starting with FirstChar in the specified Encoding.<br>
> +inline unsigned getCodePointNumBytes(char FirstChar, Encoding Encoding) {<br>
> +  switch (Encoding) {<br>
> +    case Encoding_UTF8:<br>
> +      return getNumBytesForUTF8(FirstChar);<br>
> +    default:<br>
> +      return 1;<br>
> +  }<br>
> +}<br>
> +<br>
> +inline bool isOctDigit(char c) {<br>
> +  return '0' <= c && c <= '7';<br>
> +}<br>
> +<br>
> +inline bool isHexDigit(char c) {<br>
> +  return ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||<br>
> +         ('A' <= c && c <= 'F');<br>
> +}<br>
> +<br>
> +/// \brief Gets the length of an escape sequence inside a C++ string literal.<br>
> +/// Text should span from the beginning of the escape sequence (starting with a<br>
> +/// backslash) to the end of the string literal.<br>
> +inline unsigned getEscapeSequenceLength(StringRef Text) {<br>
> +  assert(Text[0] == '\\');<br>
> +  if (Text.size() < 2)<br>
> +    return 1;<br>
> +<br>
> +  switch (Text[1]) {<br>
> +  case 'u':<br>
> +    return 6;<br>
> +  case 'U':<br>
> +    return 10;<br>
> +  case 'x': {<br>
> +    unsigned I = 2; // Point after '\x'.<br>
> +    while (I < Text.size() && isHexDigit(Text[I]))<br>
> +      ++I;<br>
> +    return I;<br>
> +  }<br>
> +  default:<br>
> +    if (isOctDigit(Text[1])) {<br>
> +      unsigned I = 1;<br>
> +      while (I < Text.size() && I < 4 && isOctDigit(Text[I]))<br>
> +        ++I;<br>
> +      return I;<br>
> +    }<br>
> +    return 2;<br>
> +  }<br>
> +}<br>
> +<br>
> +} // namespace encoding<br>
> +} // namespace format<br>
> +} // namespace clang<br>
> +<br>
> +#endif // LLVM_CLANG_FORMAT_ENCODING_H<br>
><br>
> Modified: cfe/trunk/lib/Format/Format.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Format.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Format.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/Format.cpp (original)<br>
> +++ cfe/trunk/lib/Format/Format.cpp Wed Jun  5 09:09:10 2013<br>
> @@ -243,10 +243,11 @@ public:<br>
>    UnwrappedLineFormatter(const FormatStyle &Style, SourceManager &SourceMgr,<br>
>                           const AnnotatedLine &Line, unsigned FirstIndent,<br>
>                           const FormatToken *RootToken,<br>
> -                         WhitespaceManager &Whitespaces)<br>
> +                         WhitespaceManager &Whitespaces,<br>
> +                         encoding::Encoding Encoding)<br>
>        : Style(Style), SourceMgr(SourceMgr), Line(Line),<br>
>          FirstIndent(FirstIndent), RootToken(RootToken),<br>
> -        Whitespaces(Whitespaces), Count(0) {}<br>
> +        Whitespaces(Whitespaces), Count(0), Encoding(Encoding) {}<br>
><br>
>    /// \brief Formats an \c UnwrappedLine.<br>
>    void format(const AnnotatedLine *NextLine) {<br>
> @@ -484,7 +485,7 @@ private:<br>
>                                   State.NextToken->WhitespaceRange.getEnd()) -<br>
>                               SourceMgr.getSpellingColumnNumber(<br>
>                                   State.NextToken->WhitespaceRange.getBegin());<br>
> -      State.Column += WhitespaceLength + State.NextToken->TokenLength;<br>
> +      State.Column += WhitespaceLength + State.NextToken->CodePointCount;<br>
>        State.NextToken = State.NextToken->Next;<br>
>        return 0;<br>
>      }<br>
> @@ -520,11 +521,11 @@ private:<br>
>                    Line.StartsDefinition)) {<br>
>          State.Column = State.Stack.back().Indent;<br>
>        } else if (Current.Type == TT_ObjCSelectorName) {<br>
> -        if (State.Stack.back().ColonPos > Current.TokenLength) {<br>
> -          State.Column = State.Stack.back().ColonPos - Current.TokenLength;<br>
> +        if (State.Stack.back().ColonPos > Current.CodePointCount) {<br>
> +          State.Column = State.Stack.back().ColonPos - Current.CodePointCount;<br>
>          } else {<br>
>            State.Column = State.Stack.back().Indent;<br>
> -          State.Stack.back().ColonPos = State.Column + Current.TokenLength;<br>
> +          State.Stack.back().ColonPos = State.Column + Current.CodePointCount;<br>
>          }<br>
>        } else if (Current.Type == TT_StartOfName ||<br>
>                   Previous.isOneOf(tok::coloncolon, tok::equal) ||<br>
> @@ -560,7 +561,7 @@ private:<br>
>        State.Stack.back().LastSpace = State.Column;<br>
>        if (Current.isOneOf(tok::arrow, tok::period) &&<br>
>            Current.Type != TT_DesignatedInitializerPeriod)<br>
> -        State.Stack.back().LastSpace += Current.TokenLength;<br>
> +        State.Stack.back().LastSpace += Current.CodePointCount;<br>
>        State.StartOfLineLevel = State.ParenLevel;<br>
>        State.LowestCallLevel = State.ParenLevel;<br>
><br>
> @@ -595,8 +596,8 @@ private:<br>
>          State.Stack.back().VariablePos = State.Column;<br>
>          // Move over * and & if they are bound to the variable name.<br>
>          const FormatToken *Tok = &Previous;<br>
> -        while (Tok && State.Stack.back().VariablePos >= Tok->TokenLength) {<br>
> -          State.Stack.back().VariablePos -= Tok->TokenLength;<br>
> +        while (Tok && State.Stack.back().VariablePos >= Tok->CodePointCount) {<br>
> +          State.Stack.back().VariablePos -= Tok->CodePointCount;<br>
>            if (Tok->SpacesRequiredBefore != 0)<br>
>              break;<br>
>            Tok = Tok->Previous;<br>
> @@ -614,12 +615,12 @@ private:<br>
>        if (Current.Type == TT_ObjCSelectorName &&<br>
>            State.Stack.back().ColonPos == 0) {<br>
>          if (State.Stack.back().Indent + Current.LongestObjCSelectorName ><br>
> -            State.Column + Spaces + Current.TokenLength)<br>
> +            State.Column + Spaces + Current.CodePointCount)<br>
>            State.Stack.back().ColonPos =<br>
>                State.Stack.back().Indent + Current.LongestObjCSelectorName;<br>
>          else<br>
>            State.Stack.back().ColonPos =<br>
> -              State.Column + Spaces + Current.TokenLength;<br>
> +              State.Column + Spaces + Current.CodePointCount;<br>
>        }<br>
><br>
>        if (Previous.opensScope() && Previous.Type != TT_ObjCMethodExpr &&<br>
> @@ -671,7 +672,8 @@ private:<br>
>        State.LowestCallLevel = std::min(State.LowestCallLevel, State.ParenLevel);<br>
>        if (Line.Type == LT_BuilderTypeCall && State.ParenLevel == 0)<br>
>          State.Stack.back().StartOfFunctionCall =<br>
> -            Current.LastInChainOfCalls ? 0 : State.Column + Current.TokenLength;<br>
> +            Current.LastInChainOfCalls ? 0<br>
> +                                       : State.Column + Current.CodePointCount;<br>
>      }<br>
>      if (Current.Type == TT_CtorInitializerColon) {<br>
>        // Indent 2 from the column, so:<br>
> @@ -779,7 +781,7 @@ private:<br>
>        State.StartOfStringLiteral = 0;<br>
>      }<br>
><br>
> -    State.Column += Current.TokenLength;<br>
> +    State.Column += Current.CodePointCount;<br>
><br>
>      State.NextToken = State.NextToken->Next;<br>
><br>
> @@ -798,7 +800,7 @@ private:<br>
>                                  bool DryRun) {<br>
>      unsigned UnbreakableTailLength = Current.UnbreakableTailLength;<br>
>      llvm::OwningPtr<BreakableToken> Token;<br>
> -    unsigned StartColumn = State.Column - Current.TokenLength;<br>
> +    unsigned StartColumn = State.Column - Current.CodePointCount;<br>
>      unsigned OriginalStartColumn =<br>
>          SourceMgr.getSpellingColumnNumber(Current.getStartOfNonWhitespace()) -<br>
>          1;<br>
> @@ -811,15 +813,16 @@ private:<br>
>        if (!LiteralData || *LiteralData != '"')<br>
>          return 0;<br>
><br>
> -      Token.reset(new BreakableStringLiteral(Current, StartColumn));<br>
> +      Token.reset(new BreakableStringLiteral(Current, StartColumn, Encoding));<br>
>      } else if (Current.Type == TT_BlockComment) {<br>
>        BreakableBlockComment *BBC = new BreakableBlockComment(<br>
> -          Style, Current, StartColumn, OriginalStartColumn, !Current.Previous);<br>
> +          Style, Current, StartColumn, OriginalStartColumn, !Current.Previous,<br>
> +          Encoding);<br>
>        Token.reset(BBC);<br>
>      } else if (Current.Type == TT_LineComment &&<br>
>                 (Current.Previous == NULL ||<br>
>                  Current.Previous->Type != TT_ImplicitStringLiteral)) {<br>
> -      Token.reset(new BreakableLineComment(Current, StartColumn));<br>
> +      Token.reset(new BreakableLineComment(Current, StartColumn, Encoding));<br>
>      } else {<br>
>        return 0;<br>
>      }<br>
> @@ -837,27 +840,27 @@ private:<br>
>                                         Whitespaces);<br>
>        }<br>
>        unsigned TailOffset = 0;<br>
> -      unsigned RemainingTokenLength =<br>
> +      unsigned RemainingTokenColumns =<br>
>            Token->getLineLengthAfterSplit(LineIndex, TailOffset);<br>
> -      while (RemainingTokenLength > RemainingSpace) {<br>
> +      while (RemainingTokenColumns > RemainingSpace) {<br>
>          BreakableToken::Split Split =<br>
>              Token->getSplit(LineIndex, TailOffset, getColumnLimit());<br>
>          if (Split.first == StringRef::npos)<br>
>            break;<br>
>          assert(Split.first != 0);<br>
> -        unsigned NewRemainingTokenLength = Token->getLineLengthAfterSplit(<br>
> +        unsigned NewRemainingTokenColumns = Token->getLineLengthAfterSplit(<br>
>              LineIndex, TailOffset + Split.first + Split.second);<br>
> -        assert(NewRemainingTokenLength < RemainingTokenLength);<br>
> +        assert(NewRemainingTokenColumns < RemainingTokenColumns);<br>
>          if (!DryRun) {<br>
>            Token->insertBreak(LineIndex, TailOffset, Split, Line.InPPDirective,<br>
>                               Whitespaces);<br>
>          }<br>
>          TailOffset += Split.first + Split.second;<br>
> -        RemainingTokenLength = NewRemainingTokenLength;<br>
> +        RemainingTokenColumns = NewRemainingTokenColumns;<br>
>          Penalty += Style.PenaltyExcessCharacter;<br>
>          BreakInserted = true;<br>
>        }<br>
> -      PositionAfterLastLineInToken = RemainingTokenLength;<br>
> +      PositionAfterLastLineInToken = RemainingTokenColumns;<br>
>      }<br>
><br>
>      if (BreakInserted) {<br>
> @@ -1080,13 +1083,16 @@ private:<br>
>    // Increasing count of \c StateNode items we have created. This is used<br>
>    // to create a deterministic order independent of the container.<br>
>    unsigned Count;<br>
> +  encoding::Encoding Encoding;<br>
>  };<br>
><br>
>  class FormatTokenLexer {<br>
>  public:<br>
> -  FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr)<br>
> +  FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr,<br>
> +                   encoding::Encoding Encoding)<br>
>        : FormatTok(NULL), GreaterStashed(false), TrailingWhitespace(0), Lex(Lex),<br>
> -        SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()) {<br>
> +        SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()),<br>
> +        Encoding(Encoding) {<br>
>      Lex.SetKeepWhitespaceMode(true);<br>
>    }<br>
><br>
> @@ -1111,7 +1117,8 @@ private:<br>
>            FormatTok->Tok.getLocation().getLocWithOffset(1);<br>
>        FormatTok->WhitespaceRange =<br>
>            SourceRange(GreaterLocation, GreaterLocation);<br>
> -      FormatTok->TokenLength = 1;<br>
> +      FormatTok->ByteCount = 1;<br>
> +      FormatTok->CodePointCount = 1;<br>
>        GreaterStashed = false;<br>
>        return FormatTok;<br>
>      }<br>
> @@ -1146,12 +1153,12 @@ private:<br>
>      }<br>
><br>
>      // Now FormatTok is the next non-whitespace token.<br>
> -    FormatTok->TokenLength = Text.size();<br>
> +    FormatTok->ByteCount = Text.size();<br>
><br>
>      TrailingWhitespace = 0;<br>
>      if (FormatTok->Tok.is(tok::comment)) {<br>
>        TrailingWhitespace = Text.size() - Text.rtrim().size();<br>
> -      FormatTok->TokenLength -= TrailingWhitespace;<br>
> +      FormatTok->ByteCount -= TrailingWhitespace;<br>
>      }<br>
><br>
>      // In case the token starts with escaped newlines, we want to<br>
> @@ -1164,7 +1171,7 @@ private:<br>
>      while (i + 1 < Text.size() && Text[i] == '\\' && Text[i + 1] == '\n') {<br>
>        // FIXME: ++FormatTok->NewlinesBefore is missing...<br>
>        WhitespaceLength += 2;<br>
> -      FormatTok->TokenLength -= 2;<br>
> +      FormatTok->ByteCount -= 2;<br>
>        i += 2;<br>
>      }<br>
><br>
> @@ -1176,15 +1183,19 @@ private:<br>
><br>
>      if (FormatTok->Tok.is(tok::greatergreater)) {<br>
>        FormatTok->Tok.setKind(tok::greater);<br>
> -      FormatTok->TokenLength = 1;<br>
> +      FormatTok->ByteCount = 1;<br>
>        GreaterStashed = true;<br>
>      }<br>
><br>
> +    unsigned EncodingExtraBytes =<br>
> +        Text.size() - encoding::getCodePointCount(Text, Encoding);<br>
> +    FormatTok->CodePointCount = FormatTok->ByteCount - EncodingExtraBytes;<br>
> +<br>
>      FormatTok->WhitespaceRange = SourceRange(<br>
>          WhitespaceStart, WhitespaceStart.getLocWithOffset(WhitespaceLength));<br>
>      FormatTok->TokenText = StringRef(<br>
>          SourceMgr.getCharacterData(FormatTok->getStartOfNonWhitespace()),<br>
> -        FormatTok->TokenLength);<br>
> +        FormatTok->ByteCount);<br>
>      return FormatTok;<br>
>    }<br>
><br>
> @@ -1194,6 +1205,7 @@ private:<br>
>    Lexer &Lex;<br>
>    SourceManager &SourceMgr;<br>
>    IdentifierTable IdentTable;<br>
> +  encoding::Encoding Encoding;<br>
>    llvm::SpecificBumpPtrAllocator<FormatToken> Allocator;<br>
>    SmallVector<FormatToken *, 16> Tokens;<br>
><br>
> @@ -1209,17 +1221,22 @@ public:<br>
>    Formatter(const FormatStyle &Style, Lexer &Lex, SourceManager &SourceMgr,<br>
>              const std::vector<CharSourceRange> &Ranges)<br>
>        : Style(Style), Lex(Lex), SourceMgr(SourceMgr),<br>
> -        Whitespaces(SourceMgr, Style), Ranges(Ranges) {}<br>
> +        Whitespaces(SourceMgr, Style), Ranges(Ranges),<br>
> +        Encoding(encoding::detectEncoding(Lex.getBuffer())) {<br>
> +    DEBUG(llvm::dbgs()<br>
> +          << "File encoding: "<br>
> +          << (Encoding == encoding::Encoding_UTF8 ? "UTF8" : "unknown")<br>
> +          << "\n");<br>
> +  }<br>
><br>
>    virtual ~Formatter() {}<br>
><br>
>    tooling::Replacements format() {<br>
> -    FormatTokenLexer Tokens(Lex, SourceMgr);<br>
> +    FormatTokenLexer Tokens(Lex, SourceMgr, Encoding);<br>
><br>
>      UnwrappedLineParser Parser(Style, Tokens.lex(), *this);<br>
>      bool StructuralError = Parser.parse();<br>
> -    TokenAnnotator Annotator(Style, SourceMgr, Lex,<br>
> -                             Tokens.getIdentTable().get("in"));<br>
> +    TokenAnnotator Annotator(Style, Tokens.getIdentTable().get("in"));<br>
>      for (unsigned i = 0, e = AnnotatedLines.size(); i != e; ++i) {<br>
>        Annotator.annotate(AnnotatedLines[i]);<br>
>      }<br>
> @@ -1290,7 +1307,7 @@ public:<br>
>                1;<br>
>          }<br>
>          UnwrappedLineFormatter Formatter(Style, SourceMgr, TheLine, Indent,<br>
> -                                         TheLine.First, Whitespaces);<br>
> +                                         TheLine.First, Whitespaces, Encoding);<br>
>          Formatter.format(I + 1 != E ? &*(I + 1) : NULL);<br>
>          IndentForLevel[TheLine.Level] = LevelIndent;<br>
>          PreviousLineWasTouched = true;<br>
> @@ -1556,7 +1573,7 @@ private:<br>
>      CharSourceRange LineRange = CharSourceRange::getCharRange(<br>
>          First->WhitespaceRange.getBegin().getLocWithOffset(<br>
>              First->LastNewlineOffset),<br>
> -        Last->Tok.getLocation().getLocWithOffset(Last->TokenLength - 1));<br>
> +        Last->Tok.getLocation().getLocWithOffset(Last->ByteCount - 1));<br>
>      return touchesRanges(LineRange);<br>
>    }<br>
><br>
> @@ -1616,6 +1633,8 @@ private:<br>
>    WhitespaceManager Whitespaces;<br>
>    std::vector<CharSourceRange> Ranges;<br>
>    std::vector<AnnotatedLine> AnnotatedLines;<br>
> +<br>
> +  encoding::Encoding Encoding;<br>
>  };<br>
><br>
>  tooling::Replacements reformat(const FormatStyle &Style, Lexer &Lex,<br>
><br>
> Modified: cfe/trunk/lib/Format/FormatToken.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/FormatToken.h?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/FormatToken.h?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/FormatToken.h (original)<br>
> +++ cfe/trunk/lib/Format/FormatToken.h Wed Jun  5 09:09:10 2013<br>
> @@ -61,11 +61,12 @@ enum TokenType {<br>
>  struct FormatToken {<br>
>    FormatToken()<br>
>        : NewlinesBefore(0), HasUnescapedNewline(false), LastNewlineOffset(0),<br>
> -        TokenLength(0), IsFirst(false), MustBreakBefore(false),<br>
> -        Type(TT_Unknown), SpacesRequiredBefore(0), CanBreakBefore(false),<br>
> -        ClosesTemplateDeclaration(false), ParameterCount(0), TotalLength(0),<br>
> -        UnbreakableTailLength(0), BindingStrength(0), SplitPenalty(0),<br>
> -        LongestObjCSelectorName(0), FakeRParens(0), LastInChainOfCalls(false),<br>
> +        ByteCount(0), CodePointCount(0), IsFirst(false),<br>
> +        MustBreakBefore(false), Type(TT_Unknown), SpacesRequiredBefore(0),<br>
> +        CanBreakBefore(false), ClosesTemplateDeclaration(false),<br>
> +        ParameterCount(0), TotalLength(0), UnbreakableTailLength(0),<br>
> +        BindingStrength(0), SplitPenalty(0), LongestObjCSelectorName(0),<br>
> +        FakeRParens(0), LastInChainOfCalls(false),<br>
>          PartOfMultiVariableDeclStmt(false), MatchingParen(NULL), Previous(NULL),<br>
>          Next(NULL) {}<br>
><br>
> @@ -89,10 +90,14 @@ struct FormatToken {<br>
>    /// whitespace (relative to \c WhiteSpaceStart). 0 if there is no '\n'.<br>
>    unsigned LastNewlineOffset;<br>
><br>
> -  /// \brief The length of the non-whitespace parts of the token. This is<br>
> -  /// necessary because we need to handle escaped newlines that are stored<br>
> +  /// \brief The number of bytes of the non-whitespace parts of the token. This<br>
> +  /// is necessary because we need to handle escaped newlines that are stored<br>
>    /// with the token.<br>
> -  unsigned TokenLength;<br>
> +  unsigned ByteCount;<br>
> +<br>
> +  /// \brief The length of the non-whitespace parts of the token in CodePoints.<br>
> +  /// We need this to correctly measure number of columns a token spans.<br>
> +  unsigned CodePointCount;<br>
><br>
>    /// \brief Indicates that this is the first token.<br>
>    bool IsFirst;<br>
><br>
> Modified: cfe/trunk/lib/Format/TokenAnnotator.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/TokenAnnotator.cpp (original)<br>
> +++ cfe/trunk/lib/Format/TokenAnnotator.cpp Wed Jun  5 09:09:10 2013<br>
> @@ -15,7 +15,6 @@<br>
><br>
>  #include "TokenAnnotator.h"<br>
>  #include "clang/Basic/SourceManager.h"<br>
> -#include "clang/Lex/Lexer.h"<br>
>  #include "llvm/Support/Debug.h"<br>
><br>
>  namespace clang {<br>
> @@ -28,10 +27,9 @@ namespace format {<br>
>  /// into template parameter lists.<br>
>  class AnnotatingParser {<br>
>  public:<br>
> -  AnnotatingParser(SourceManager &SourceMgr, Lexer &Lex, AnnotatedLine &Line,<br>
> -                   IdentifierInfo &Ident_in)<br>
> -      : SourceMgr(SourceMgr), Lex(Lex), Line(Line), CurrentToken(Line.First),<br>
> -        KeywordVirtualFound(false), NameFound(false), Ident_in(Ident_in) {<br>
> +  AnnotatingParser(AnnotatedLine &Line, IdentifierInfo &Ident_in)<br>
> +      : Line(Line), CurrentToken(Line.First), KeywordVirtualFound(false),<br>
> +        NameFound(false), Ident_in(Ident_in) {<br>
>      Contexts.push_back(Context(tok::unknown, 1, /*IsExpression=*/ false));<br>
>    }<br>
><br>
> @@ -295,9 +293,11 @@ private:<br>
>                   Line.First->Type == TT_ObjCMethodSpecifier) {<br>
>          Tok->Type = TT_ObjCMethodExpr;<br>
>          Tok->Previous->Type = TT_ObjCSelectorName;<br>
> -        if (Tok->Previous->TokenLength ><br>
> -            Contexts.back().LongestObjCSelectorName)<br>
> -          Contexts.back().LongestObjCSelectorName = Tok->Previous->TokenLength;<br>
> +        if (Tok->Previous->CodePointCount ><br>
> +            Contexts.back().LongestObjCSelectorName) {<br>
> +          Contexts.back().LongestObjCSelectorName =<br>
> +              Tok->Previous->CodePointCount;<br>
> +        }<br>
>          if (Contexts.back().FirstObjCSelectorName == NULL)<br>
>            Contexts.back().FirstObjCSelectorName = Tok->Previous;<br>
>        } else if (Contexts.back().ColonIsForRangeExpr) {<br>
> @@ -602,9 +602,7 @@ private:<br>
>        } else if (Current.isBinaryOperator()) {<br>
>          Current.Type = TT_BinaryOperator;<br>
>        } else if (Current.is(tok::comment)) {<br>
> -        std::string Data(<br>
> -            Lexer::getSpelling(Current.Tok, SourceMgr, Lex.getLangOpts()));<br>
> -        if (StringRef(Data).startswith("//"))<br>
> +        if (Current.TokenText.startswith("//"))<br>
>            Current.Type = TT_LineComment;<br>
>          else<br>
>            Current.Type = TT_BlockComment;<br>
> @@ -748,23 +746,19 @@ private:<br>
>      case tok::kw_wchar_t:<br>
>      case tok::kw_bool:<br>
>      case tok::kw___underlying_type:<br>
> -      return true;<br>
>      case tok::annot_typename:<br>
>      case tok::kw_char16_t:<br>
>      case tok::kw_char32_t:<br>
>      case tok::kw_typeof:<br>
>      case tok::kw_decltype:<br>
> -      return Lex.getLangOpts().CPlusPlus;<br>
> +      return true;<br>
>      default:<br>
> -      break;<br>
> +      return false;<br>
>      }<br>
> -    return false;<br>
>    }<br>
><br>
>    SmallVector<Context, 8> Contexts;<br>
><br>
> -  SourceManager &SourceMgr;<br>
> -  Lexer &Lex;<br>
>    AnnotatedLine &Line;<br>
>    FormatToken *CurrentToken;<br>
>    bool KeywordVirtualFound;<br>
> @@ -866,7 +860,7 @@ private:<br>
>  };<br>
><br>
>  void TokenAnnotator::annotate(AnnotatedLine &Line) {<br>
> -  AnnotatingParser Parser(SourceMgr, Lex, Line, Ident_in);<br>
> +  AnnotatingParser Parser(Line, Ident_in);<br>
>    Line.Type = Parser.parseLine();<br>
>    if (Line.Type == LT_Invalid)<br>
>      return;<br>
> @@ -886,7 +880,7 @@ void TokenAnnotator::annotate(AnnotatedL<br>
>  }<br>
><br>
>  void TokenAnnotator::calculateFormattingInformation(AnnotatedLine &Line) {<br>
> -  Line.First->TotalLength = Line.First->TokenLength;<br>
> +  Line.First->TotalLength = Line.First->CodePointCount;<br>
>    if (!Line.First->Next)<br>
>      return;<br>
>    FormatToken *Current = Line.First->Next;<br>
> @@ -920,7 +914,7 @@ void TokenAnnotator::calculateFormatting<br>
>        Current->TotalLength = Current->Previous->TotalLength + Style.ColumnLimit;<br>
>      else<br>
>        Current->TotalLength =<br>
> -          Current->Previous->TotalLength + Current->TokenLength +<br>
> +          Current->Previous->TotalLength + Current->CodePointCount +<br>
>            Current->SpacesRequiredBefore;<br>
>      // FIXME: Only calculate this if CanBreakBefore is true once static<br>
>      // initializers etc. are sorted out.<br>
> @@ -947,7 +941,7 @@ void TokenAnnotator::calculateUnbreakabl<br>
>        UnbreakableTailLength = 0;<br>
>      } else {<br>
>        UnbreakableTailLength +=<br>
> -          Current->TokenLength + Current->SpacesRequiredBefore;<br>
> +          Current->CodePointCount + Current->SpacesRequiredBefore;<br>
>      }<br>
>      Current = Current->Previous;<br>
>    }<br>
> @@ -1015,8 +1009,7 @@ unsigned TokenAnnotator::splitPenalty(co<br>
><br>
>    if (Right.is(tok::lessless)) {<br>
>      if (Left.is(tok::string_literal)) {<br>
> -      StringRef Content =<br>
> -          StringRef(Left.Tok.getLiteralData(), Left.TokenLength);<br>
> +      StringRef Content = Left.TokenText;<br>
>        Content = Content.drop_back(1).drop_front(1).trim();<br>
>        if (Content.size() > 1 &&<br>
>            (Content.back() == ':' || Content.back() == '='))<br>
><br>
> Modified: cfe/trunk/lib/Format/TokenAnnotator.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.h?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.h?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/lib/Format/TokenAnnotator.h (original)<br>
> +++ cfe/trunk/lib/Format/TokenAnnotator.h Wed Jun  5 09:09:10 2013<br>
> @@ -21,7 +21,6 @@<br>
>  #include <string><br>
><br>
>  namespace clang {<br>
> -class Lexer;<br>
>  class SourceManager;<br>
><br>
>  namespace format {<br>
> @@ -71,10 +70,8 @@ public:<br>
>  /// \c UnwrappedLine.<br>
>  class TokenAnnotator {<br>
>  public:<br>
> -  TokenAnnotator(const FormatStyle &Style, SourceManager &SourceMgr, Lexer &Lex,<br>
> -                 IdentifierInfo &Ident_in)<br>
> -      : Style(Style), SourceMgr(SourceMgr), Lex(Lex), Ident_in(Ident_in) {<br>
> -  }<br>
> +  TokenAnnotator(const FormatStyle &Style, IdentifierInfo &Ident_in)<br>
> +      : Style(Style), Ident_in(Ident_in) {}<br>
><br>
>    void annotate(AnnotatedLine &Line);<br>
>    void calculateFormattingInformation(AnnotatedLine &Line);<br>
> @@ -95,8 +92,6 @@ private:<br>
>    void calculateUnbreakableTailLengths(AnnotatedLine &Line);<br>
><br>
>    const FormatStyle &Style;<br>
> -  SourceManager &SourceMgr;<br>
> -  Lexer &Lex;<br>
><br>
>    // Contextual keywords:<br>
>    IdentifierInfo &Ident_in;<br>
><br>
> Modified: cfe/trunk/unittests/Format/FormatTest.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/unittests/Format/FormatTest.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/unittests/Format/FormatTest.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>





> ==============================================================================<br>
> --- cfe/trunk/unittests/Format/FormatTest.cpp (original)<br>
> +++ cfe/trunk/unittests/Format/FormatTest.cpp Wed Jun  5 09:09:10 2013<br>
> @@ -4873,5 +4873,80 @@ TEST_F(FormatTest, ConfigurationRoundTri<br>
>    EXPECT_EQ(Style, ParsedStyle);<br>
>  }<br>
><br>
> +TEST_F(FormatTest, WorksFor8bitEncodings) {<br>
> +  EXPECT_EQ("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 \"\n"<br>
> +            "\"\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \"\n"<br>
> +            "\"\xe7\xe8\xec\xed\xfe\xfe \"\n"<br>
> +            "\"\xef\xee\xf0\xf3...\"",<br>
> +            format("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 "<br>
> +                   "\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \xe7\xe8\xec\xed\xfe\xfe "<br>
> +                   "\xef\xee\xf0\xf3...\"",<br>
> +                   getLLVMStyleWithColumns(12)));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, CountsUTF8CharactersProperly) {<br>
> +  verifyFormat("\"Однажды Ð² Ñ Ñ‚удёную Ð·Ð¸Ð¼Ð½ÑŽÑŽ Ð¿Ð¾Ñ€Ñƒ...\"",<br>
> +               getLLVMStyleWithColumns(35));<br>
> +  verifyFormat("\"一 äºŒ ä¸‰ å›› äº” å…­ ä¸ƒ å…« ä¹  å  \"",<br>
> +               getLLVMStyleWithColumns(21));<br>
> +  verifyFormat("// ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹ Ð² Ñ Ñ‚удёную Ð·Ð¸Ð¼Ð½ÑŽÑŽ Ð¿Ð¾Ñ€Ñƒ...",<br>
> +               getLLVMStyleWithColumns(36));<br>
> +  verifyFormat("// ä¸€ äºŒ ä¸‰ å›› äº” å…­ ä¸ƒ å…« ä¹  å  ",<br>
> +               getLLVMStyleWithColumns(22));<br>
> +  verifyFormat("/* ÐžÐ´Ð½Ð°Ð¶Ð´Ñ‹ Ð² Ñ Ñ‚удёную Ð·Ð¸Ð¼Ð½ÑŽÑŽ Ð¿Ð¾Ñ€Ñƒ... */",<br>
> +               getLLVMStyleWithColumns(39));<br>
> +  verifyFormat("/* ä¸€ äºŒ ä¸‰ å›› äº” å…­ ä¸ƒ å…« ä¹  å   */",<br>
> +               getLLVMStyleWithColumns(25));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, SplitsUTF8Strings) {<br>
> +  EXPECT_EQ(<br>
> +      "\"Однажды, Ð² \"\n"<br>
> +      "\"Ѡтудёную \"\n"<br>
> +      "\"зимнюю \"\n"<br>
> +      "\"пору,\"",<br>
> +      format("\"Однажды, Ð² Ñ Ñ‚удёную Ð·Ð¸Ð¼Ð½ÑŽÑŽ Ð¿Ð¾Ñ€Ñƒ,\"",<br>
> +             getLLVMStyleWithColumns(13)));<br>
> +  EXPECT_EQ("\"一 äºŒ ä¸‰ å›› \"\n"<br>
> +            "\"五 å…­ ä¸ƒ å…« \"\n"<br>
> +            "\"ä¹  å  \"",<br>
> +            format("\"一 äºŒ ä¸‰ å›› äº” å…­ ä¸ƒ å…« ä¹  å  \"",<br>
> +                   getLLVMStyleWithColumns(10)));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, SplitsUTF8LineComments) {<br>
> +  EXPECT_EQ("// Ð¯ Ð¸Ð· Ð»ÐµÑ Ñƒ\n"<br>
> +            "// Ð²Ñ‹ÑˆÐµÐ»; Ð±Ñ‹Ð»\n"<br>
> +            "// Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹\n"<br>
> +            "// Ð¼Ð¾Ñ€Ð¾Ð·.",<br>
> +            format("// Ð¯ Ð¸Ð· Ð»ÐµÑ Ñƒ Ð²Ñ‹ÑˆÐµÐ»; Ð±Ñ‹Ð» Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹ Ð¼Ð¾Ñ€Ð¾Ð·.",<br>
> +                   getLLVMStyleWithColumns(13)));<br>
> +  EXPECT_EQ("// ä¸€äºŒä¸‰\n"<br>
> +            "// å››äº”六七\n"<br>
> +            "// å…«\n"<br>
> +            "// ä¹  å  ",<br>
> +            format("// ä¸€äºŒä¸‰ å››äº”六七 å…«  ä¹  å  ", getLLVMStyleWithColumns(6)));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, SplitsUTF8BlockComments) {<br>
> +  EXPECT_EQ("/* Ð“лѠжу,\n"<br>
> +            " * Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ \n"<br>
> +            " * Ð¼ÐµÐ´Ð»ÐµÐ½Ð½Ð¾ Ð²\n"<br>
> +            " * Ð³Ð¾Ñ€Ñƒ\n"<br>
> +            " * Ð›Ð¾ÑˆÐ°Ð´ÐºÐ°,\n"<br>
> +            " * Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ \n"<br>
> +            " * Ñ…вороѠту\n"<br>
> +            " * Ð²Ð¾Ð·. */",<br>
> +            format("/* Ð“лѠжу, Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ  Ð¼ÐµÐ´Ð»ÐµÐ½Ð½Ð¾ Ð² Ð³Ð¾Ñ€Ñƒ\n"<br>
> +                   " * Ð›Ð¾ÑˆÐ°Ð´ÐºÐ°, Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ  Ñ…вороѠту Ð²Ð¾Ð·. */",<br>
> +                   getLLVMStyleWithColumns(13)));<br>
> +  EXPECT_EQ("/* ä¸€äºŒä¸‰\n"<br>
> +            " * å››äº”六七\n"<br>
> +            " * å…«\n"<br>
> +            " * ä¹  å  \n"<br>
> +            " */",<br>
> +            format("/* ä¸€äºŒä¸‰ å››äº”六七 å…«  ä¹  å   */", getLLVMStyleWithColumns(6)));<br>
> +}<br>
> +<br>
>  } // end namespace tooling<br>
>  } // end namespace clang<br>
><br>
><br>
> _______________________________________________<br>
> cfe-commits mailing list<br>
> <a href="mailto:cfe-commits@cs.uiuc.edu" target="_blank">cfe-commits@cs.uiuc.edu</a><br>
> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
</div></div></blockquote></div></div></div><br>
</div></div>
<br>_______________________________________________<br>
cfe-commits mailing list<br>
<a href="mailto:cfe-commits@cs.uiuc.edu" target="_blank">cfe-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
<br></blockquote></div></div></div><br></div></div>
</blockquote></div></div><br>
</div></div>
<br>_______________________________________________<br>
cfe-commits mailing list<br>
<a href="mailto:cfe-commits@cs.uiuc.edu">cfe-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
<br></blockquote></div><br></div></div>