<div dir="ltr"><div>On Fri, Jun 7, 2013 at 6:52 AM, Alexander Kornienko <span dir="ltr"><<a href="mailto:alexfh@google.com" target="_blank">alexfh@google.com</a>></span> wrote:<br></div><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="im">On Fri, Jun 7, 2013 at 6:46 AM, Nico Weber <span dir="ltr"><<a href="mailto:thakis@chromium.org" target="_blank">thakis@chromium.org</a>></span> wrote:<br>
</div><div class="gmail_extra"><div class="gmail_quote"><div class="im">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div>On Thu, Jun 6, 2013 at 4:49 PM, Alexander Kornienko <span dir="ltr"><<a href="mailto:alexfh@google.com" target="_blank">alexfh@google.com</a>></span> wrote:<br>
</div><div class="gmail_extra"><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div>On Thu, Jun 6, 2013 at 1:11 AM, NAKAMURA Takumi <span dir="ltr"><<a href="mailto:geek4civic@gmail.com" target="_blank">geek4civic@gmail.com</a>></span> wrote:<br>
</div><div class="gmail_extra"><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I wonder the source file could contain utf8 characters.<br>
</blockquote><div><br></div></div><div>It's implementation-defined behavior. Apparently GCC and Clang handle this correctly.</div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
In fact, MS cl.exe misdetects charsets against rather system<br>
codepage(932) than current codepage (65001), without BOM.<br></blockquote><div> </div></div><div>Seems like adding UTF-8 BOM is the only way to force MSVC treat a source file as UTF-8. But this is not supported by GCC and Clang, AFAIK.</div>
</div></div></div></blockquote><div><br></div></div><div>clang's Lexer::InitLexer() skips BOMs.</div></div></div></div></blockquote><div><br></div></div><div>Sounds interesting. And <a href="http://stackoverflow.com/questions/7899795/is-it-possible-to-get-gcc-to-compile-utf-8-with-bom-source-files" target="_blank">here</a> they say that GCC also supports this. I've checked with Clang trunk and GCC 4.6.3, and it works. Then are there any reasons not to just add UTF-8 BOM?</div>
</div></div></div></blockquote><div><br></div><div style>The Unicode Standard 6.0 Core Spec said use of a BOM is neither required nor recommended for UTF-8 (p.30). Treating the BOM as magic bytes indicating that the file is in UTF-8 encoding seems too Microsoft specific. So I guess adding a BOM unconditionally may not be a good idea.</div>
<div style><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div class="h5"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">
<div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>
<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Could you get rid of raw utf8 characters and encode them in literals?<br>
FYI, I can see Cyrillic and CJK :)<br></blockquote></div><div><div><br></div><div>There's a plan to make some of our tests file-based instead of unit tests. I think, utf-8 tests are the first candidate for this. As UTF-8 support is not the most important thing for Windows builds of clang-format, I'd leave the new tests just #ifdefed out for now. BTW, thanks for doing this.</div>
</div><div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
...Takumi<br>
<br>
2013/6/5 Alexander Kornienko <<a href="mailto:alexfh@google.com" target="_blank">alexfh@google.com</a>>:<br>
<div><div>> Author: alexfh<br>
> Date: Wed Jun 5 09:09:10 2013<br>
> New Revision: 183312<br>
><br>
> URL: <a href="http://llvm.org/viewvc/llvm-project?rev=183312&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=183312&view=rev</a><br>
> Log:<br>
> UTF-8 support for clang-format.<br>
><br>
> Summary:<br>
> Detect if the file is valid UTF-8, and if this is the case, count code<br>
> points instead of just using number of bytes in all (hopefully) places, where<br>
> number of columns is needed. In particular, use the new<br>
> FormatToken.CodePointCount instead of TokenLength where appropriate.<br>
> Changed BreakableToken implementations to respect utf-8 character boundaries<br>
> when in utf-8 mode.<br>
><br>
> Reviewers: klimek, djasper<br>
><br>
> Reviewed By: djasper<br>
><br>
> CC: cfe-commits, rsmith, gribozavr<br>
><br>
> Differential Revision: <a href="http://llvm-reviews.chandlerc.com/D918" target="_blank">http://llvm-reviews.chandlerc.com/D918</a><br>
><br>
> Added:<br>
> cfe/trunk/lib/Format/Encoding.h<br>
> Modified:<br>
> cfe/trunk/lib/Format/BreakableToken.cpp<br>
> cfe/trunk/lib/Format/BreakableToken.h<br>
> cfe/trunk/lib/Format/Format.cpp<br>
> cfe/trunk/lib/Format/FormatToken.h<br>
> cfe/trunk/lib/Format/TokenAnnotator.cpp<br>
> cfe/trunk/lib/Format/TokenAnnotator.h<br>
> cfe/trunk/unittests/Format/FormatTest.cpp<br>
><br>
> Modified: cfe/trunk/lib/Format/BreakableToken.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/BreakableToken.cpp (original)<br>
> +++ cfe/trunk/lib/Format/BreakableToken.cpp Wed Jun 5 09:09:10 2013<br>
> @@ -25,66 +25,22 @@ namespace clang {<br>
> namespace format {<br>
> namespace {<br>
><br>
> -// FIXME: Move helper string functions to where it makes sense.<br>
> -<br>
> -unsigned getOctalLength(StringRef Text) {<br>
> - unsigned I = 1;<br>
> - while (I < Text.size() && I < 4 && (Text[I] >= '0' && Text[I] <= '7')) {<br>
> - ++I;<br>
> - }<br>
> - return I;<br>
> -}<br>
> -<br>
> -unsigned getHexLength(StringRef Text) {<br>
> - unsigned I = 2; // Point after '\x'.<br>
> - while (I < Text.size() && ((Text[I] >= '0' && Text[I] <= '9') ||<br>
> - (Text[I] >= 'a' && Text[I] <= 'f') ||<br>
> - (Text[I] >= 'A' && Text[I] <= 'F'))) {<br>
> - ++I;<br>
> - }<br>
> - return I;<br>
> -}<br>
> -<br>
> -unsigned getEscapeSequenceLength(StringRef Text) {<br>
> - assert(Text[0] == '\\');<br>
> - if (Text.size() < 2)<br>
> - return 1;<br>
> -<br>
> - switch (Text[1]) {<br>
> - case 'u':<br>
> - return 6;<br>
> - case 'U':<br>
> - return 10;<br>
> - case 'x':<br>
> - return getHexLength(Text);<br>
> - default:<br>
> - if (Text[1] >= '0' && Text[1] <= '7')<br>
> - return getOctalLength(Text);<br>
> - return 2;<br>
> - }<br>
> -}<br>
> -<br>
> -StringRef::size_type getStartOfCharacter(StringRef Text,<br>
> - StringRef::size_type Offset) {<br>
> - StringRef::size_type NextEscape = Text.find('\\');<br>
> - while (NextEscape != StringRef::npos && NextEscape < Offset) {<br>
> - StringRef::size_type SequenceLength =<br>
> - getEscapeSequenceLength(Text.substr(NextEscape));<br>
> - if (Offset < NextEscape + SequenceLength)<br>
> - return NextEscape;<br>
> - NextEscape = Text.find('\\', NextEscape + SequenceLength);<br>
> - }<br>
> - return Offset;<br>
> -}<br>
> -<br>
> BreakableToken::Split getCommentSplit(StringRef Text,<br>
> unsigned ContentStartColumn,<br>
> - unsigned ColumnLimit) {<br>
> + unsigned ColumnLimit,<br>
> + encoding::Encoding Encoding) {<br>
> if (ColumnLimit <= ContentStartColumn + 1)<br>
> return BreakableToken::Split(StringRef::npos, 0);<br>
><br>
> unsigned MaxSplit = ColumnLimit - ContentStartColumn + 1;<br>
> - StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);<br>
> + unsigned MaxSplitBytes = 0;<br>
> +<br>
> + for (unsigned NumChars = 0;<br>
> + NumChars < MaxSplit && MaxSplitBytes < Text.size(); ++NumChars)<br>
> + MaxSplitBytes +=<br>
> + encoding::getCodePointNumBytes(Text[MaxSplitBytes], Encoding);<br>
> +<br>
> + StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplitBytes);<br>
> if (SpaceOffset == StringRef::npos ||<br>
> // Don't break at leading whitespace.<br>
> Text.find_last_not_of(' ', SpaceOffset) == StringRef::npos) {<br>
> @@ -95,7 +51,7 @@ BreakableToken::Split getCommentSplit(St<br>
> // If the comment is only whitespace, we cannot split.<br>
> return BreakableToken::Split(StringRef::npos, 0);<br>
> SpaceOffset =<br>
> - Text.find(' ', std::max<unsigned>(MaxSplit, FirstNonWhitespace));<br>
> + Text.find(' ', std::max<unsigned>(MaxSplitBytes, FirstNonWhitespace));<br>
> }<br>
> if (SpaceOffset != StringRef::npos && SpaceOffset != 0) {<br>
> StringRef BeforeCut = Text.substr(0, SpaceOffset).rtrim();<br>
> @@ -108,25 +64,48 @@ BreakableToken::Split getCommentSplit(St<br>
><br>
> BreakableToken::Split getStringSplit(StringRef Text,<br>
> unsigned ContentStartColumn,<br>
> - unsigned ColumnLimit) {<br>
> -<br>
> - if (ColumnLimit <= ContentStartColumn)<br>
> - return BreakableToken::Split(StringRef::npos, 0);<br>
> - unsigned MaxSplit = ColumnLimit - ContentStartColumn;<br>
> + unsigned ColumnLimit,<br>
> + encoding::Encoding Encoding) {<br>
> // FIXME: Reduce unit test case.<br>
> if (Text.empty())<br>
> return BreakableToken::Split(StringRef::npos, 0);<br>
> - MaxSplit = std::min<unsigned>(MaxSplit, Text.size() - 1);<br>
> - StringRef::size_type SpaceOffset = Text.rfind(' ', MaxSplit);<br>
> - if (SpaceOffset != StringRef::npos && SpaceOffset != 0)<br>
> + if (ColumnLimit <= ContentStartColumn)<br>
> + return BreakableToken::Split(StringRef::npos, 0);<br>
> + unsigned MaxSplit =<br>
> + std::min<unsigned>(ColumnLimit - ContentStartColumn,<br>
> + encoding::getCodePointCount(Text, Encoding) - 1);<br>
> + StringRef::size_type SpaceOffset = 0;<br>
> + StringRef::size_type SlashOffset = 0;<br>
> + StringRef::size_type SplitPoint = 0;<br>
> + for (unsigned Chars = 0;;) {<br>
> + unsigned Advance;<br>
> + if (Text[0] == '\\') {<br>
> + Advance = encoding::getEscapeSequenceLength(Text);<br>
> + Chars += Advance;<br>
> + } else {<br>
> + Advance = encoding::getCodePointNumBytes(Text[0], Encoding);<br>
> + Chars += 1;<br>
> + }<br>
> +<br>
> + if (Chars > MaxSplit)<br>
> + break;<br>
> +<br>
> + if (Text[0] == ' ')<br>
> + SpaceOffset = SplitPoint;<br>
> + if (Text[0] == '/')<br>
> + SlashOffset = SplitPoint;<br>
> +<br>
> + SplitPoint += Advance;<br>
> + Text = Text.substr(Advance);<br>
> + }<br>
> +<br>
> + if (SpaceOffset != 0)<br>
> return BreakableToken::Split(SpaceOffset + 1, 0);<br>
> - StringRef::size_type SlashOffset = Text.rfind('/', MaxSplit);<br>
> - if (SlashOffset != StringRef::npos && SlashOffset != 0)<br>
> + if (SlashOffset != 0)<br>
> return BreakableToken::Split(SlashOffset + 1, 0);<br>
> - StringRef::size_type SplitPoint = getStartOfCharacter(Text, MaxSplit);<br>
> - if (SplitPoint == StringRef::npos || SplitPoint == 0)<br>
> - return BreakableToken::Split(StringRef::npos, 0);<br>
> - return BreakableToken::Split(SplitPoint, 0);<br>
> + if (SplitPoint != 0)<br>
> + return BreakableToken::Split(SplitPoint, 0);<br>
> + return BreakableToken::Split(StringRef::npos, 0);<br>
> }<br>
><br>
> } // namespace<br>
> @@ -136,8 +115,8 @@ unsigned BreakableSingleLineToken::getLi<br>
> unsigned<br>
> BreakableSingleLineToken::getLineLengthAfterSplit(unsigned LineIndex,<br>
> unsigned TailOffset) const {<br>
> - return StartColumn + Prefix.size() + Postfix.size() + Line.size() -<br>
> - TailOffset;<br>
> + return StartColumn + Prefix.size() + Postfix.size() +<br>
> + encoding::getCodePointCount(Line.substr(TailOffset), Encoding);<br>
> }<br>
><br>
> void BreakableSingleLineToken::insertBreak(unsigned LineIndex,<br>
> @@ -152,8 +131,9 @@ void BreakableSingleLineToken::insertBre<br>
> BreakableSingleLineToken::BreakableSingleLineToken(const FormatToken &Tok,<br>
> unsigned StartColumn,<br>
> StringRef Prefix,<br>
> - StringRef Postfix)<br>
> - : BreakableToken(Tok), StartColumn(StartColumn), Prefix(Prefix),<br>
> + StringRef Postfix,<br>
> + encoding::Encoding Encoding)<br>
> + : BreakableToken(Tok, Encoding), StartColumn(StartColumn), Prefix(Prefix),<br>
> Postfix(Postfix) {<br>
> assert(Tok.TokenText.startswith(Prefix) && Tok.TokenText.endswith(Postfix));<br>
> Line = Tok.TokenText.substr(<br>
> @@ -161,13 +141,15 @@ BreakableSingleLineToken::BreakableSingl<br>
> }<br>
><br>
> BreakableStringLiteral::BreakableStringLiteral(const FormatToken &Tok,<br>
> - unsigned StartColumn)<br>
> - : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"") {}<br>
> + unsigned StartColumn,<br>
> + encoding::Encoding Encoding)<br>
> + : BreakableSingleLineToken(Tok, StartColumn, "\"", "\"", Encoding) {}<br>
><br>
> BreakableToken::Split<br>
> BreakableStringLiteral::getSplit(unsigned LineIndex, unsigned TailOffset,<br>
> unsigned ColumnLimit) const {<br>
> - return getStringSplit(Line.substr(TailOffset), StartColumn + 2, ColumnLimit);<br>
> + return getStringSplit(Line.substr(TailOffset), StartColumn + 2, ColumnLimit,<br>
> + Encoding);<br>
> }<br>
><br>
> static StringRef getLineCommentPrefix(StringRef Comment) {<br>
> @@ -179,23 +161,23 @@ static StringRef getLineCommentPrefix(St<br>
> }<br>
><br>
> BreakableLineComment::BreakableLineComment(const FormatToken &Token,<br>
> - unsigned StartColumn)<br>
> + unsigned StartColumn,<br>
> + encoding::Encoding Encoding)<br>
> : BreakableSingleLineToken(Token, StartColumn,<br>
> - getLineCommentPrefix(Token.TokenText), "") {}<br>
> + getLineCommentPrefix(Token.TokenText), "",<br>
> + Encoding) {}<br>
><br>
> BreakableToken::Split<br>
> BreakableLineComment::getSplit(unsigned LineIndex, unsigned TailOffset,<br>
> unsigned ColumnLimit) const {<br>
> return getCommentSplit(Line.substr(TailOffset), StartColumn + Prefix.size(),<br>
> - ColumnLimit);<br>
> + ColumnLimit, Encoding);<br>
> }<br>
><br>
> -BreakableBlockComment::BreakableBlockComment(const FormatStyle &Style,<br>
> - const FormatToken &Token,<br>
> - unsigned StartColumn,<br>
> - unsigned OriginalStartColumn,<br>
> - bool FirstInLine)<br>
> - : BreakableToken(Token) {<br>
> +BreakableBlockComment::BreakableBlockComment(<br>
> + const FormatStyle &Style, const FormatToken &Token, unsigned StartColumn,<br>
> + unsigned OriginalStartColumn, bool FirstInLine, encoding::Encoding Encoding)<br>
> + : BreakableToken(Token, Encoding) {<br>
> StringRef TokenText(Token.TokenText);<br>
> assert(TokenText.startswith("/*") && TokenText.endswith("*/"));<br>
> TokenText.substr(2, TokenText.size() - 4).split(Lines, "\n");<br>
> @@ -290,7 +272,8 @@ unsigned<br>
> BreakableBlockComment::getLineLengthAfterSplit(unsigned LineIndex,<br>
> unsigned TailOffset) const {<br>
> return getContentStartColumn(LineIndex, TailOffset) +<br>
> - (Lines[LineIndex].size() - TailOffset) +<br>
> + encoding::getCodePointCount(Lines[LineIndex].substr(TailOffset),<br>
> + Encoding) +<br>
> // The last line gets a "*/" postfix.<br>
> (LineIndex + 1 == Lines.size() ? 2 : 0);<br>
> }<br>
> @@ -300,7 +283,7 @@ BreakableBlockComment::getSplit(unsigned<br>
> unsigned ColumnLimit) const {<br>
> return getCommentSplit(Lines[LineIndex].substr(TailOffset),<br>
> getContentStartColumn(LineIndex, TailOffset),<br>
> - ColumnLimit);<br>
> + ColumnLimit, Encoding);<br>
> }<br>
><br>
> void BreakableBlockComment::insertBreak(unsigned LineIndex, unsigned TailOffset,<br>
><br>
> Modified: cfe/trunk/lib/Format/BreakableToken.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.h?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/BreakableToken.h?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/BreakableToken.h (original)<br>
> +++ cfe/trunk/lib/Format/BreakableToken.h Wed Jun 5 09:09:10 2013<br>
> @@ -17,6 +17,7 @@<br>
> #ifndef LLVM_CLANG_FORMAT_BREAKABLETOKEN_H<br>
> #define LLVM_CLANG_FORMAT_BREAKABLETOKEN_H<br>
><br>
> +#include "Encoding.h"<br>
> #include "TokenAnnotator.h"<br>
> #include "WhitespaceManager.h"<br>
> #include <utility><br>
> @@ -65,9 +66,11 @@ public:<br>
> WhitespaceManager &Whitespaces) {}<br>
><br>
> protected:<br>
> - BreakableToken(const FormatToken &Tok) : Tok(Tok) {}<br>
> + BreakableToken(const FormatToken &Tok, encoding::Encoding Encoding)<br>
> + : Tok(Tok), Encoding(Encoding) {}<br>
><br>
> const FormatToken &Tok;<br>
> + encoding::Encoding Encoding;<br>
> };<br>
><br>
> /// \brief Base class for single line tokens that can be broken.<br>
> @@ -83,7 +86,8 @@ public:<br>
><br>
> protected:<br>
> BreakableSingleLineToken(const FormatToken &Tok, unsigned StartColumn,<br>
> - StringRef Prefix, StringRef Postfix);<br>
> + StringRef Prefix, StringRef Postfix,<br>
> + encoding::Encoding Encoding);<br>
><br>
> // The column in which the token starts.<br>
> unsigned StartColumn;<br>
> @@ -101,7 +105,8 @@ public:<br>
> ///<br>
> /// \p StartColumn specifies the column in which the token will start<br>
> /// after formatting.<br>
> - BreakableStringLiteral(const FormatToken &Tok, unsigned StartColumn);<br>
> + BreakableStringLiteral(const FormatToken &Tok, unsigned StartColumn,<br>
> + encoding::Encoding Encoding);<br>
><br>
> virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,<br>
> unsigned ColumnLimit) const;<br>
> @@ -113,7 +118,8 @@ public:<br>
> ///<br>
> /// \p StartColumn specifies the column in which the comment will start<br>
> /// after formatting.<br>
> - BreakableLineComment(const FormatToken &Token, unsigned StartColumn);<br>
> + BreakableLineComment(const FormatToken &Token, unsigned StartColumn,<br>
> + encoding::Encoding Encoding);<br>
><br>
> virtual Split getSplit(unsigned LineIndex, unsigned TailOffset,<br>
> unsigned ColumnLimit) const;<br>
> @@ -129,7 +135,7 @@ public:<br>
> /// If the comment starts a line after formatting, set \p FirstInLine to true.<br>
> BreakableBlockComment(const FormatStyle &Style, const FormatToken &Token,<br>
> unsigned StartColumn, unsigned OriginaStartColumn,<br>
> - bool FirstInLine);<br>
> + bool FirstInLine, encoding::Encoding Encoding);<br>
><br>
> virtual unsigned getLineCount() const;<br>
> virtual unsigned getLineLengthAfterSplit(unsigned LineIndex,<br>
><br>
> Added: cfe/trunk/lib/Format/Encoding.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Encoding.h?rev=183312&view=auto" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Encoding.h?rev=183312&view=auto</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/Encoding.h (added)<br>
> +++ cfe/trunk/lib/Format/Encoding.h Wed Jun 5 09:09:10 2013<br>
> @@ -0,0 +1,114 @@<br>
> +//===--- Encoding.h - Format C++ code -------------------------------------===//<br>
> +//<br>
> +// The LLVM Compiler Infrastructure<br>
> +//<br>
> +// This file is distributed under the University of Illinois Open Source<br>
> +// License. See LICENSE.TXT for details.<br>
> +//<br>
> +//===----------------------------------------------------------------------===//<br>
> +///<br>
> +/// \file<br>
> +/// \brief Contains functions for text encoding manipulation. Supports UTF-8,<br>
> +/// 8-bit encodings and escape sequences in C++ string literals.<br>
> +///<br>
> +//===----------------------------------------------------------------------===//<br>
> +<br>
> +#ifndef LLVM_CLANG_FORMAT_ENCODING_H<br>
> +#define LLVM_CLANG_FORMAT_ENCODING_H<br>
> +<br>
> +#include "clang/Basic/LLVM.h"<br>
> +#include "llvm/Support/ConvertUTF.h"<br>
> +<br>
> +namespace clang {<br>
> +namespace format {<br>
> +namespace encoding {<br>
> +<br>
> +enum Encoding {<br>
> + Encoding_UTF8,<br>
> + Encoding_Unknown // We treat all other encodings as 8-bit encodings.<br>
> +};<br>
> +<br>
> +/// \brief Detects encoding of the Text. If the Text can be decoded using UTF-8,<br>
> +/// it is considered UTF8, otherwise we treat it as some 8-bit encoding.<br>
> +inline Encoding detectEncoding(StringRef Text) {<br>
> + const UTF8 *Ptr = reinterpret_cast<const UTF8 *>(Text.begin());<br>
> + const UTF8 *BufEnd = reinterpret_cast<const UTF8 *>(Text.end());<br>
> + if (::isLegalUTF8String(&Ptr, BufEnd))<br>
> + return Encoding_UTF8;<br>
> + return Encoding_Unknown;<br>
> +}<br>
> +<br>
> +inline unsigned getCodePointCountUTF8(StringRef Text) {<br>
> + unsigned CodePoints = 0;<br>
> + for (size_t i = 0, e = Text.size(); i < e; i += getNumBytesForUTF8(Text[i])) {<br>
> + ++CodePoints;<br>
> + }<br>
> + return CodePoints;<br>
> +}<br>
> +<br>
> +/// \brief Gets the number of code points in the Text using the specified<br>
> +/// Encoding.<br>
> +inline unsigned getCodePointCount(StringRef Text, Encoding Encoding) {<br>
> + switch (Encoding) {<br>
> + case Encoding_UTF8:<br>
> + return getCodePointCountUTF8(Text);<br>
> + default:<br>
> + return Text.size();<br>
> + }<br>
> +}<br>
> +<br>
> +/// \brief Gets the number of bytes in a sequence representing a single<br>
> +/// codepoint and starting with FirstChar in the specified Encoding.<br>
> +inline unsigned getCodePointNumBytes(char FirstChar, Encoding Encoding) {<br>
> + switch (Encoding) {<br>
> + case Encoding_UTF8:<br>
> + return getNumBytesForUTF8(FirstChar);<br>
> + default:<br>
> + return 1;<br>
> + }<br>
> +}<br>
> +<br>
> +inline bool isOctDigit(char c) {<br>
> + return '0' <= c && c <= '7';<br>
> +}<br>
> +<br>
> +inline bool isHexDigit(char c) {<br>
> + return ('0' <= c && c <= '9') || ('a' <= c && c <= 'f') ||<br>
> + ('A' <= c && c <= 'F');<br>
> +}<br>
> +<br>
> +/// \brief Gets the length of an escape sequence inside a C++ string literal.<br>
> +/// Text should span from the beginning of the escape sequence (starting with a<br>
> +/// backslash) to the end of the string literal.<br>
> +inline unsigned getEscapeSequenceLength(StringRef Text) {<br>
> + assert(Text[0] == '\\');<br>
> + if (Text.size() < 2)<br>
> + return 1;<br>
> +<br>
> + switch (Text[1]) {<br>
> + case 'u':<br>
> + return 6;<br>
> + case 'U':<br>
> + return 10;<br>
> + case 'x': {<br>
> + unsigned I = 2; // Point after '\x'.<br>
> + while (I < Text.size() && isHexDigit(Text[I]))<br>
> + ++I;<br>
> + return I;<br>
> + }<br>
> + default:<br>
> + if (isOctDigit(Text[1])) {<br>
> + unsigned I = 1;<br>
> + while (I < Text.size() && I < 4 && isOctDigit(Text[I]))<br>
> + ++I;<br>
> + return I;<br>
> + }<br>
> + return 2;<br>
> + }<br>
> +}<br>
> +<br>
> +} // namespace encoding<br>
> +} // namespace format<br>
> +} // namespace clang<br>
> +<br>
> +#endif // LLVM_CLANG_FORMAT_ENCODING_H<br>
><br>
> Modified: cfe/trunk/lib/Format/Format.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Format.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/Format.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/Format.cpp (original)<br>
> +++ cfe/trunk/lib/Format/Format.cpp Wed Jun 5 09:09:10 2013<br>
> @@ -243,10 +243,11 @@ public:<br>
> UnwrappedLineFormatter(const FormatStyle &Style, SourceManager &SourceMgr,<br>
> const AnnotatedLine &Line, unsigned FirstIndent,<br>
> const FormatToken *RootToken,<br>
> - WhitespaceManager &Whitespaces)<br>
> + WhitespaceManager &Whitespaces,<br>
> + encoding::Encoding Encoding)<br>
> : Style(Style), SourceMgr(SourceMgr), Line(Line),<br>
> FirstIndent(FirstIndent), RootToken(RootToken),<br>
> - Whitespaces(Whitespaces), Count(0) {}<br>
> + Whitespaces(Whitespaces), Count(0), Encoding(Encoding) {}<br>
><br>
> /// \brief Formats an \c UnwrappedLine.<br>
> void format(const AnnotatedLine *NextLine) {<br>
> @@ -484,7 +485,7 @@ private:<br>
> State.NextToken->WhitespaceRange.getEnd()) -<br>
> SourceMgr.getSpellingColumnNumber(<br>
> State.NextToken->WhitespaceRange.getBegin());<br>
> - State.Column += WhitespaceLength + State.NextToken->TokenLength;<br>
> + State.Column += WhitespaceLength + State.NextToken->CodePointCount;<br>
> State.NextToken = State.NextToken->Next;<br>
> return 0;<br>
> }<br>
> @@ -520,11 +521,11 @@ private:<br>
> Line.StartsDefinition)) {<br>
> State.Column = State.Stack.back().Indent;<br>
> } else if (Current.Type == TT_ObjCSelectorName) {<br>
> - if (State.Stack.back().ColonPos > Current.TokenLength) {<br>
> - State.Column = State.Stack.back().ColonPos - Current.TokenLength;<br>
> + if (State.Stack.back().ColonPos > Current.CodePointCount) {<br>
> + State.Column = State.Stack.back().ColonPos - Current.CodePointCount;<br>
> } else {<br>
> State.Column = State.Stack.back().Indent;<br>
> - State.Stack.back().ColonPos = State.Column + Current.TokenLength;<br>
> + State.Stack.back().ColonPos = State.Column + Current.CodePointCount;<br>
> }<br>
> } else if (Current.Type == TT_StartOfName ||<br>
> Previous.isOneOf(tok::coloncolon, tok::equal) ||<br>
> @@ -560,7 +561,7 @@ private:<br>
> State.Stack.back().LastSpace = State.Column;<br>
> if (Current.isOneOf(tok::arrow, tok::period) &&<br>
> Current.Type != TT_DesignatedInitializerPeriod)<br>
> - State.Stack.back().LastSpace += Current.TokenLength;<br>
> + State.Stack.back().LastSpace += Current.CodePointCount;<br>
> State.StartOfLineLevel = State.ParenLevel;<br>
> State.LowestCallLevel = State.ParenLevel;<br>
><br>
> @@ -595,8 +596,8 @@ private:<br>
> State.Stack.back().VariablePos = State.Column;<br>
> // Move over * and & if they are bound to the variable name.<br>
> const FormatToken *Tok = &Previous;<br>
> - while (Tok && State.Stack.back().VariablePos >= Tok->TokenLength) {<br>
> - State.Stack.back().VariablePos -= Tok->TokenLength;<br>
> + while (Tok && State.Stack.back().VariablePos >= Tok->CodePointCount) {<br>
> + State.Stack.back().VariablePos -= Tok->CodePointCount;<br>
> if (Tok->SpacesRequiredBefore != 0)<br>
> break;<br>
> Tok = Tok->Previous;<br>
> @@ -614,12 +615,12 @@ private:<br>
> if (Current.Type == TT_ObjCSelectorName &&<br>
> State.Stack.back().ColonPos == 0) {<br>
> if (State.Stack.back().Indent + Current.LongestObjCSelectorName ><br>
> - State.Column + Spaces + Current.TokenLength)<br>
> + State.Column + Spaces + Current.CodePointCount)<br>
> State.Stack.back().ColonPos =<br>
> State.Stack.back().Indent + Current.LongestObjCSelectorName;<br>
> else<br>
> State.Stack.back().ColonPos =<br>
> - State.Column + Spaces + Current.TokenLength;<br>
> + State.Column + Spaces + Current.CodePointCount;<br>
> }<br>
><br>
> if (Previous.opensScope() && Previous.Type != TT_ObjCMethodExpr &&<br>
> @@ -671,7 +672,8 @@ private:<br>
> State.LowestCallLevel = std::min(State.LowestCallLevel, State.ParenLevel);<br>
> if (Line.Type == LT_BuilderTypeCall && State.ParenLevel == 0)<br>
> State.Stack.back().StartOfFunctionCall =<br>
> - Current.LastInChainOfCalls ? 0 : State.Column + Current.TokenLength;<br>
> + Current.LastInChainOfCalls ? 0<br>
> + : State.Column + Current.CodePointCount;<br>
> }<br>
> if (Current.Type == TT_CtorInitializerColon) {<br>
> // Indent 2 from the column, so:<br>
> @@ -779,7 +781,7 @@ private:<br>
> State.StartOfStringLiteral = 0;<br>
> }<br>
><br>
> - State.Column += Current.TokenLength;<br>
> + State.Column += Current.CodePointCount;<br>
><br>
> State.NextToken = State.NextToken->Next;<br>
><br>
> @@ -798,7 +800,7 @@ private:<br>
> bool DryRun) {<br>
> unsigned UnbreakableTailLength = Current.UnbreakableTailLength;<br>
> llvm::OwningPtr<BreakableToken> Token;<br>
> - unsigned StartColumn = State.Column - Current.TokenLength;<br>
> + unsigned StartColumn = State.Column - Current.CodePointCount;<br>
> unsigned OriginalStartColumn =<br>
> SourceMgr.getSpellingColumnNumber(Current.getStartOfNonWhitespace()) -<br>
> 1;<br>
> @@ -811,15 +813,16 @@ private:<br>
> if (!LiteralData || *LiteralData != '"')<br>
> return 0;<br>
><br>
> - Token.reset(new BreakableStringLiteral(Current, StartColumn));<br>
> + Token.reset(new BreakableStringLiteral(Current, StartColumn, Encoding));<br>
> } else if (Current.Type == TT_BlockComment) {<br>
> BreakableBlockComment *BBC = new BreakableBlockComment(<br>
> - Style, Current, StartColumn, OriginalStartColumn, !Current.Previous);<br>
> + Style, Current, StartColumn, OriginalStartColumn, !Current.Previous,<br>
> + Encoding);<br>
> Token.reset(BBC);<br>
> } else if (Current.Type == TT_LineComment &&<br>
> (Current.Previous == NULL ||<br>
> Current.Previous->Type != TT_ImplicitStringLiteral)) {<br>
> - Token.reset(new BreakableLineComment(Current, StartColumn));<br>
> + Token.reset(new BreakableLineComment(Current, StartColumn, Encoding));<br>
> } else {<br>
> return 0;<br>
> }<br>
> @@ -837,27 +840,27 @@ private:<br>
> Whitespaces);<br>
> }<br>
> unsigned TailOffset = 0;<br>
> - unsigned RemainingTokenLength =<br>
> + unsigned RemainingTokenColumns =<br>
> Token->getLineLengthAfterSplit(LineIndex, TailOffset);<br>
> - while (RemainingTokenLength > RemainingSpace) {<br>
> + while (RemainingTokenColumns > RemainingSpace) {<br>
> BreakableToken::Split Split =<br>
> Token->getSplit(LineIndex, TailOffset, getColumnLimit());<br>
> if (Split.first == StringRef::npos)<br>
> break;<br>
> assert(Split.first != 0);<br>
> - unsigned NewRemainingTokenLength = Token->getLineLengthAfterSplit(<br>
> + unsigned NewRemainingTokenColumns = Token->getLineLengthAfterSplit(<br>
> LineIndex, TailOffset + Split.first + Split.second);<br>
> - assert(NewRemainingTokenLength < RemainingTokenLength);<br>
> + assert(NewRemainingTokenColumns < RemainingTokenColumns);<br>
> if (!DryRun) {<br>
> Token->insertBreak(LineIndex, TailOffset, Split, Line.InPPDirective,<br>
> Whitespaces);<br>
> }<br>
> TailOffset += Split.first + Split.second;<br>
> - RemainingTokenLength = NewRemainingTokenLength;<br>
> + RemainingTokenColumns = NewRemainingTokenColumns;<br>
> Penalty += Style.PenaltyExcessCharacter;<br>
> BreakInserted = true;<br>
> }<br>
> - PositionAfterLastLineInToken = RemainingTokenLength;<br>
> + PositionAfterLastLineInToken = RemainingTokenColumns;<br>
> }<br>
><br>
> if (BreakInserted) {<br>
> @@ -1080,13 +1083,16 @@ private:<br>
> // Increasing count of \c StateNode items we have created. This is used<br>
> // to create a deterministic order independent of the container.<br>
> unsigned Count;<br>
> + encoding::Encoding Encoding;<br>
> };<br>
><br>
> class FormatTokenLexer {<br>
> public:<br>
> - FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr)<br>
> + FormatTokenLexer(Lexer &Lex, SourceManager &SourceMgr,<br>
> + encoding::Encoding Encoding)<br>
> : FormatTok(NULL), GreaterStashed(false), TrailingWhitespace(0), Lex(Lex),<br>
> - SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()) {<br>
> + SourceMgr(SourceMgr), IdentTable(Lex.getLangOpts()),<br>
> + Encoding(Encoding) {<br>
> Lex.SetKeepWhitespaceMode(true);<br>
> }<br>
><br>
> @@ -1111,7 +1117,8 @@ private:<br>
> FormatTok->Tok.getLocation().getLocWithOffset(1);<br>
> FormatTok->WhitespaceRange =<br>
> SourceRange(GreaterLocation, GreaterLocation);<br>
> - FormatTok->TokenLength = 1;<br>
> + FormatTok->ByteCount = 1;<br>
> + FormatTok->CodePointCount = 1;<br>
> GreaterStashed = false;<br>
> return FormatTok;<br>
> }<br>
> @@ -1146,12 +1153,12 @@ private:<br>
> }<br>
><br>
> // Now FormatTok is the next non-whitespace token.<br>
> - FormatTok->TokenLength = Text.size();<br>
> + FormatTok->ByteCount = Text.size();<br>
><br>
> TrailingWhitespace = 0;<br>
> if (FormatTok->Tok.is(tok::comment)) {<br>
> TrailingWhitespace = Text.size() - Text.rtrim().size();<br>
> - FormatTok->TokenLength -= TrailingWhitespace;<br>
> + FormatTok->ByteCount -= TrailingWhitespace;<br>
> }<br>
><br>
> // In case the token starts with escaped newlines, we want to<br>
> @@ -1164,7 +1171,7 @@ private:<br>
> while (i + 1 < Text.size() && Text[i] == '\\' && Text[i + 1] == '\n') {<br>
> // FIXME: ++FormatTok->NewlinesBefore is missing...<br>
> WhitespaceLength += 2;<br>
> - FormatTok->TokenLength -= 2;<br>
> + FormatTok->ByteCount -= 2;<br>
> i += 2;<br>
> }<br>
><br>
> @@ -1176,15 +1183,19 @@ private:<br>
><br>
> if (FormatTok->Tok.is(tok::greatergreater)) {<br>
> FormatTok->Tok.setKind(tok::greater);<br>
> - FormatTok->TokenLength = 1;<br>
> + FormatTok->ByteCount = 1;<br>
> GreaterStashed = true;<br>
> }<br>
><br>
> + unsigned EncodingExtraBytes =<br>
> + Text.size() - encoding::getCodePointCount(Text, Encoding);<br>
> + FormatTok->CodePointCount = FormatTok->ByteCount - EncodingExtraBytes;<br>
> +<br>
> FormatTok->WhitespaceRange = SourceRange(<br>
> WhitespaceStart, WhitespaceStart.getLocWithOffset(WhitespaceLength));<br>
> FormatTok->TokenText = StringRef(<br>
> SourceMgr.getCharacterData(FormatTok->getStartOfNonWhitespace()),<br>
> - FormatTok->TokenLength);<br>
> + FormatTok->ByteCount);<br>
> return FormatTok;<br>
> }<br>
><br>
> @@ -1194,6 +1205,7 @@ private:<br>
> Lexer &Lex;<br>
> SourceManager &SourceMgr;<br>
> IdentifierTable IdentTable;<br>
> + encoding::Encoding Encoding;<br>
> llvm::SpecificBumpPtrAllocator<FormatToken> Allocator;<br>
> SmallVector<FormatToken *, 16> Tokens;<br>
><br>
> @@ -1209,17 +1221,22 @@ public:<br>
> Formatter(const FormatStyle &Style, Lexer &Lex, SourceManager &SourceMgr,<br>
> const std::vector<CharSourceRange> &Ranges)<br>
> : Style(Style), Lex(Lex), SourceMgr(SourceMgr),<br>
> - Whitespaces(SourceMgr, Style), Ranges(Ranges) {}<br>
> + Whitespaces(SourceMgr, Style), Ranges(Ranges),<br>
> + Encoding(encoding::detectEncoding(Lex.getBuffer())) {<br>
> + DEBUG(llvm::dbgs()<br>
> + << "File encoding: "<br>
> + << (Encoding == encoding::Encoding_UTF8 ? "UTF8" : "unknown")<br>
> + << "\n");<br>
> + }<br>
><br>
> virtual ~Formatter() {}<br>
><br>
> tooling::Replacements format() {<br>
> - FormatTokenLexer Tokens(Lex, SourceMgr);<br>
> + FormatTokenLexer Tokens(Lex, SourceMgr, Encoding);<br>
><br>
> UnwrappedLineParser Parser(Style, Tokens.lex(), *this);<br>
> bool StructuralError = Parser.parse();<br>
> - TokenAnnotator Annotator(Style, SourceMgr, Lex,<br>
> - Tokens.getIdentTable().get("in"));<br>
> + TokenAnnotator Annotator(Style, Tokens.getIdentTable().get("in"));<br>
> for (unsigned i = 0, e = AnnotatedLines.size(); i != e; ++i) {<br>
> Annotator.annotate(AnnotatedLines[i]);<br>
> }<br>
> @@ -1290,7 +1307,7 @@ public:<br>
> 1;<br>
> }<br>
> UnwrappedLineFormatter Formatter(Style, SourceMgr, TheLine, Indent,<br>
> - TheLine.First, Whitespaces);<br>
> + TheLine.First, Whitespaces, Encoding);<br>
> Formatter.format(I + 1 != E ? &*(I + 1) : NULL);<br>
> IndentForLevel[TheLine.Level] = LevelIndent;<br>
> PreviousLineWasTouched = true;<br>
> @@ -1556,7 +1573,7 @@ private:<br>
> CharSourceRange LineRange = CharSourceRange::getCharRange(<br>
> First->WhitespaceRange.getBegin().getLocWithOffset(<br>
> First->LastNewlineOffset),<br>
> - Last->Tok.getLocation().getLocWithOffset(Last->TokenLength - 1));<br>
> + Last->Tok.getLocation().getLocWithOffset(Last->ByteCount - 1));<br>
> return touchesRanges(LineRange);<br>
> }<br>
><br>
> @@ -1616,6 +1633,8 @@ private:<br>
> WhitespaceManager Whitespaces;<br>
> std::vector<CharSourceRange> Ranges;<br>
> std::vector<AnnotatedLine> AnnotatedLines;<br>
> +<br>
> + encoding::Encoding Encoding;<br>
> };<br>
><br>
> tooling::Replacements reformat(const FormatStyle &Style, Lexer &Lex,<br>
><br>
> Modified: cfe/trunk/lib/Format/FormatToken.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/FormatToken.h?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/FormatToken.h?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/FormatToken.h (original)<br>
> +++ cfe/trunk/lib/Format/FormatToken.h Wed Jun 5 09:09:10 2013<br>
> @@ -61,11 +61,12 @@ enum TokenType {<br>
> struct FormatToken {<br>
> FormatToken()<br>
> : NewlinesBefore(0), HasUnescapedNewline(false), LastNewlineOffset(0),<br>
> - TokenLength(0), IsFirst(false), MustBreakBefore(false),<br>
> - Type(TT_Unknown), SpacesRequiredBefore(0), CanBreakBefore(false),<br>
> - ClosesTemplateDeclaration(false), ParameterCount(0), TotalLength(0),<br>
> - UnbreakableTailLength(0), BindingStrength(0), SplitPenalty(0),<br>
> - LongestObjCSelectorName(0), FakeRParens(0), LastInChainOfCalls(false),<br>
> + ByteCount(0), CodePointCount(0), IsFirst(false),<br>
> + MustBreakBefore(false), Type(TT_Unknown), SpacesRequiredBefore(0),<br>
> + CanBreakBefore(false), ClosesTemplateDeclaration(false),<br>
> + ParameterCount(0), TotalLength(0), UnbreakableTailLength(0),<br>
> + BindingStrength(0), SplitPenalty(0), LongestObjCSelectorName(0),<br>
> + FakeRParens(0), LastInChainOfCalls(false),<br>
> PartOfMultiVariableDeclStmt(false), MatchingParen(NULL), Previous(NULL),<br>
> Next(NULL) {}<br>
><br>
> @@ -89,10 +90,14 @@ struct FormatToken {<br>
> /// whitespace (relative to \c WhiteSpaceStart). 0 if there is no '\n'.<br>
> unsigned LastNewlineOffset;<br>
><br>
> - /// \brief The length of the non-whitespace parts of the token. This is<br>
> - /// necessary because we need to handle escaped newlines that are stored<br>
> + /// \brief The number of bytes of the non-whitespace parts of the token. This<br>
> + /// is necessary because we need to handle escaped newlines that are stored<br>
> /// with the token.<br>
> - unsigned TokenLength;<br>
> + unsigned ByteCount;<br>
> +<br>
> + /// \brief The length of the non-whitespace parts of the token in CodePoints.<br>
> + /// We need this to correctly measure number of columns a token spans.<br>
> + unsigned CodePointCount;<br>
><br>
> /// \brief Indicates that this is the first token.<br>
> bool IsFirst;<br>
><br>
> Modified: cfe/trunk/lib/Format/TokenAnnotator.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/TokenAnnotator.cpp (original)<br>
> +++ cfe/trunk/lib/Format/TokenAnnotator.cpp Wed Jun 5 09:09:10 2013<br>
> @@ -15,7 +15,6 @@<br>
><br>
> #include "TokenAnnotator.h"<br>
> #include "clang/Basic/SourceManager.h"<br>
> -#include "clang/Lex/Lexer.h"<br>
> #include "llvm/Support/Debug.h"<br>
><br>
> namespace clang {<br>
> @@ -28,10 +27,9 @@ namespace format {<br>
> /// into template parameter lists.<br>
> class AnnotatingParser {<br>
> public:<br>
> - AnnotatingParser(SourceManager &SourceMgr, Lexer &Lex, AnnotatedLine &Line,<br>
> - IdentifierInfo &Ident_in)<br>
> - : SourceMgr(SourceMgr), Lex(Lex), Line(Line), CurrentToken(Line.First),<br>
> - KeywordVirtualFound(false), NameFound(false), Ident_in(Ident_in) {<br>
> + AnnotatingParser(AnnotatedLine &Line, IdentifierInfo &Ident_in)<br>
> + : Line(Line), CurrentToken(Line.First), KeywordVirtualFound(false),<br>
> + NameFound(false), Ident_in(Ident_in) {<br>
> Contexts.push_back(Context(tok::unknown, 1, /*IsExpression=*/ false));<br>
> }<br>
><br>
> @@ -295,9 +293,11 @@ private:<br>
> Line.First->Type == TT_ObjCMethodSpecifier) {<br>
> Tok->Type = TT_ObjCMethodExpr;<br>
> Tok->Previous->Type = TT_ObjCSelectorName;<br>
> - if (Tok->Previous->TokenLength ><br>
> - Contexts.back().LongestObjCSelectorName)<br>
> - Contexts.back().LongestObjCSelectorName = Tok->Previous->TokenLength;<br>
> + if (Tok->Previous->CodePointCount ><br>
> + Contexts.back().LongestObjCSelectorName) {<br>
> + Contexts.back().LongestObjCSelectorName =<br>
> + Tok->Previous->CodePointCount;<br>
> + }<br>
> if (Contexts.back().FirstObjCSelectorName == NULL)<br>
> Contexts.back().FirstObjCSelectorName = Tok->Previous;<br>
> } else if (Contexts.back().ColonIsForRangeExpr) {<br>
> @@ -602,9 +602,7 @@ private:<br>
> } else if (Current.isBinaryOperator()) {<br>
> Current.Type = TT_BinaryOperator;<br>
> } else if (Current.is(tok::comment)) {<br>
> - std::string Data(<br>
> - Lexer::getSpelling(Current.Tok, SourceMgr, Lex.getLangOpts()));<br>
> - if (StringRef(Data).startswith("//"))<br>
> + if (Current.TokenText.startswith("//"))<br>
> Current.Type = TT_LineComment;<br>
> else<br>
> Current.Type = TT_BlockComment;<br>
> @@ -748,23 +746,19 @@ private:<br>
> case tok::kw_wchar_t:<br>
> case tok::kw_bool:<br>
> case tok::kw___underlying_type:<br>
> - return true;<br>
> case tok::annot_typename:<br>
> case tok::kw_char16_t:<br>
> case tok::kw_char32_t:<br>
> case tok::kw_typeof:<br>
> case tok::kw_decltype:<br>
> - return Lex.getLangOpts().CPlusPlus;<br>
> + return true;<br>
> default:<br>
> - break;<br>
> + return false;<br>
> }<br>
> - return false;<br>
> }<br>
><br>
> SmallVector<Context, 8> Contexts;<br>
><br>
> - SourceManager &SourceMgr;<br>
> - Lexer &Lex;<br>
> AnnotatedLine &Line;<br>
> FormatToken *CurrentToken;<br>
> bool KeywordVirtualFound;<br>
> @@ -866,7 +860,7 @@ private:<br>
> };<br>
><br>
> void TokenAnnotator::annotate(AnnotatedLine &Line) {<br>
> - AnnotatingParser Parser(SourceMgr, Lex, Line, Ident_in);<br>
> + AnnotatingParser Parser(Line, Ident_in);<br>
> Line.Type = Parser.parseLine();<br>
> if (Line.Type == LT_Invalid)<br>
> return;<br>
> @@ -886,7 +880,7 @@ void TokenAnnotator::annotate(AnnotatedL<br>
> }<br>
><br>
> void TokenAnnotator::calculateFormattingInformation(AnnotatedLine &Line) {<br>
> - Line.First->TotalLength = Line.First->TokenLength;<br>
> + Line.First->TotalLength = Line.First->CodePointCount;<br>
> if (!Line.First->Next)<br>
> return;<br>
> FormatToken *Current = Line.First->Next;<br>
> @@ -920,7 +914,7 @@ void TokenAnnotator::calculateFormatting<br>
> Current->TotalLength = Current->Previous->TotalLength + Style.ColumnLimit;<br>
> else<br>
> Current->TotalLength =<br>
> - Current->Previous->TotalLength + Current->TokenLength +<br>
> + Current->Previous->TotalLength + Current->CodePointCount +<br>
> Current->SpacesRequiredBefore;<br>
> // FIXME: Only calculate this if CanBreakBefore is true once static<br>
> // initializers etc. are sorted out.<br>
> @@ -947,7 +941,7 @@ void TokenAnnotator::calculateUnbreakabl<br>
> UnbreakableTailLength = 0;<br>
> } else {<br>
> UnbreakableTailLength +=<br>
> - Current->TokenLength + Current->SpacesRequiredBefore;<br>
> + Current->CodePointCount + Current->SpacesRequiredBefore;<br>
> }<br>
> Current = Current->Previous;<br>
> }<br>
> @@ -1015,8 +1009,7 @@ unsigned TokenAnnotator::splitPenalty(co<br>
><br>
> if (Right.is(tok::lessless)) {<br>
> if (Left.is(tok::string_literal)) {<br>
> - StringRef Content =<br>
> - StringRef(Left.Tok.getLiteralData(), Left.TokenLength);<br>
> + StringRef Content = Left.TokenText;<br>
> Content = Content.drop_back(1).drop_front(1).trim();<br>
> if (Content.size() > 1 &&<br>
> (Content.back() == ':' || Content.back() == '='))<br>
><br>
> Modified: cfe/trunk/lib/Format/TokenAnnotator.h<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.h?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Format/TokenAnnotator.h?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/lib/Format/TokenAnnotator.h (original)<br>
> +++ cfe/trunk/lib/Format/TokenAnnotator.h Wed Jun 5 09:09:10 2013<br>
> @@ -21,7 +21,6 @@<br>
> #include <string><br>
><br>
> namespace clang {<br>
> -class Lexer;<br>
> class SourceManager;<br>
><br>
> namespace format {<br>
> @@ -71,10 +70,8 @@ public:<br>
> /// \c UnwrappedLine.<br>
> class TokenAnnotator {<br>
> public:<br>
> - TokenAnnotator(const FormatStyle &Style, SourceManager &SourceMgr, Lexer &Lex,<br>
> - IdentifierInfo &Ident_in)<br>
> - : Style(Style), SourceMgr(SourceMgr), Lex(Lex), Ident_in(Ident_in) {<br>
> - }<br>
> + TokenAnnotator(const FormatStyle &Style, IdentifierInfo &Ident_in)<br>
> + : Style(Style), Ident_in(Ident_in) {}<br>
><br>
> void annotate(AnnotatedLine &Line);<br>
> void calculateFormattingInformation(AnnotatedLine &Line);<br>
> @@ -95,8 +92,6 @@ private:<br>
> void calculateUnbreakableTailLengths(AnnotatedLine &Line);<br>
><br>
> const FormatStyle &Style;<br>
> - SourceManager &SourceMgr;<br>
> - Lexer &Lex;<br>
><br>
> // Contextual keywords:<br>
> IdentifierInfo &Ident_in;<br>
><br>
> Modified: cfe/trunk/unittests/Format/FormatTest.cpp<br>
> URL: <a href="http://llvm.org/viewvc/llvm-project/cfe/trunk/unittests/Format/FormatTest.cpp?rev=183312&r1=183311&r2=183312&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/cfe/trunk/unittests/Format/FormatTest.cpp?rev=183312&r1=183311&r2=183312&view=diff</a><br>
> ==============================================================================<br>
> --- cfe/trunk/unittests/Format/FormatTest.cpp (original)<br>
> +++ cfe/trunk/unittests/Format/FormatTest.cpp Wed Jun 5 09:09:10 2013<br>
> @@ -4873,5 +4873,80 @@ TEST_F(FormatTest, ConfigurationRoundTri<br>
> EXPECT_EQ(Style, ParsedStyle);<br>
> }<br>
><br>
> +TEST_F(FormatTest, WorksFor8bitEncodings) {<br>
> + EXPECT_EQ("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 \"\n"<br>
> + "\"\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \"\n"<br>
> + "\"\xe7\xe8\xec\xed\xfe\xfe \"\n"<br>
> + "\"\xef\xee\xf0\xf3...\"",<br>
> + format("\"\xce\xe4\xed\xe0\xe6\xe4\xfb \xe2 "<br>
> + "\xf1\xf2\xf3\xe4\xb8\xed\xf3\xfe \xe7\xe8\xec\xed\xfe\xfe "<br>
> + "\xef\xee\xf0\xf3...\"",<br>
> + getLLVMStyleWithColumns(12)));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, CountsUTF8CharactersProperly) {<br>
> + verifyFormat("\"Однажды в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору...\"",<br>
> + getLLVMStyleWithColumns(35));<br>
> + verifyFormat("\"一 二 三 å›› 五 å… ä¸ƒ å…« ä¹ å \"",<br>
> + getLLVMStyleWithColumns(21));<br>
> + verifyFormat("// Однажды в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору...",<br>
> + getLLVMStyleWithColumns(36));<br>
> + verifyFormat("// 一 二 三 å›› 五 å… ä¸ƒ å…« ä¹ å ",<br>
> + getLLVMStyleWithColumns(22));<br>
> + verifyFormat("/* Однажды в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору... */",<br>
> + getLLVMStyleWithColumns(39));<br>
> + verifyFormat("/* 一 二 三 å›› 五 å… ä¸ƒ å…« ä¹ å */",<br>
> + getLLVMStyleWithColumns(25));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, SplitsUTF8Strings) {<br>
> + EXPECT_EQ(<br>
> + "\"Однажды, в \"\n"<br>
> + "\"Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ \"\n"<br>
> + "\"зимнюю \"\n"<br>
> + "\"пору,\"",<br>
> + format("\"Однажды, в Ñ Ñ‚ÑƒÐ´Ñ‘Ð½ÑƒÑŽ зимнюю пору,\"",<br>
> + getLLVMStyleWithColumns(13)));<br>
> + EXPECT_EQ("\"一 二 三 四 \"\n"<br>
> + "\"五 å… ä¸ƒ å…« \"\n"<br>
> + "\"ä¹ å \"",<br>
> + format("\"一 二 三 å›› 五 å… ä¸ƒ å…« ä¹ å \"",<br>
> + getLLVMStyleWithColumns(10)));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, SplitsUTF8LineComments) {<br>
> + EXPECT_EQ("// Я из Ð»ÐµÑ Ñƒ\n"<br>
> + "// вышел; был\n"<br>
> + "// Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹\n"<br>
> + "// мороз.",<br>
> + format("// Я из Ð»ÐµÑ Ñƒ вышел; был Ñ Ð¸Ð»ÑŒÐ½Ñ‹Ð¹ мороз.",<br>
> + getLLVMStyleWithColumns(13)));<br>
> + EXPECT_EQ("// 一二三\n"<br>
> + "// 四五å…七\n"<br>
> + "// å…«\n"<br>
> + "// ä¹ å ",<br>
> + format("// 一二三 四五å…七 å…« ä¹ å ", getLLVMStyleWithColumns(6)));<br>
> +}<br>
> +<br>
> +TEST_F(FormatTest, SplitsUTF8BlockComments) {<br>
> + EXPECT_EQ("/* Ð“Ð»Ñ Ð¶Ñƒ,\n"<br>
> + " * Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ \n"<br>
> + " * медленно в\n"<br>
> + " * гору\n"<br>
> + " * Лошадка,\n"<br>
> + " * Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ \n"<br>
> + " * Ñ…Ð²Ð¾Ñ€Ð¾Ñ Ñ‚Ñƒ\n"<br>
> + " * воз. */",<br>
> + format("/* Ð“Ð»Ñ Ð¶Ñƒ, Ð¿Ð¾Ð´Ð½Ð¸Ð¼Ð°ÐµÑ‚Ñ Ñ Ð¼ÐµÐ´Ð»ÐµÐ½Ð½Ð¾ в гору\n"<br>
> + " * Лошадка, Ð²ÐµÐ·ÑƒÑ‰Ð°Ñ Ñ…Ð²Ð¾Ñ€Ð¾Ñ Ñ‚Ñƒ воз. */",<br>
> + getLLVMStyleWithColumns(13)));<br>
> + EXPECT_EQ("/* 一二三\n"<br>
> + " * 四五å…七\n"<br>
> + " * å…«\n"<br>
> + " * ä¹ å \n"<br>
> + " */",<br>
> + format("/* 一二三 四五å…七 å…« ä¹ å */", getLLVMStyleWithColumns(6)));<br>
> +}<br>
> +<br>
> } // end namespace tooling<br>
> } // end namespace clang<br>
><br>
><br>
> _______________________________________________<br>
> cfe-commits mailing list<br>
> <a href="mailto:cfe-commits@cs.uiuc.edu" target="_blank">cfe-commits@cs.uiuc.edu</a><br>
> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
</div></div></blockquote></div></div></div><br>
</div></div>
<br>_______________________________________________<br>
cfe-commits mailing list<br>
<a href="mailto:cfe-commits@cs.uiuc.edu" target="_blank">cfe-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
<br></blockquote></div></div></div><br></div></div>
</blockquote></div></div><br>
</div></div>
<br>_______________________________________________<br>
cfe-commits mailing list<br>
<a href="mailto:cfe-commits@cs.uiuc.edu">cfe-commits@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits</a><br>
<br></blockquote></div><br></div></div>