r173850 - [Doc parsing] Patch to parse Doxygen-supported HTML character
Dmitri Gribenko
gribozavr at gmail.com
Tue Jan 29 16:01:44 PST 2013
Hi Fariborz,
On Wed, Jan 30, 2013 at 1:42 AM, Fariborz Jahanian <fjahanian at apple.com> wrote:
> Author: fjahanian
> Date: Tue Jan 29 17:42:26 2013
> New Revision: 173850
>
> URL: http://llvm.org/viewvc/llvm-project?rev=173850&view=rev
> Log:
> [Doc parsing] Patch to parse Doxygen-supported HTML character
> references to their UTIF-8 encoding. Reviewed offline by Doug.
> // rdar://12392215
>
> Added:
> cfe/trunk/test/Index/special-html-characters.m
> Modified:
> cfe/trunk/include/clang/AST/CommentLexer.h
> cfe/trunk/lib/AST/CommentLexer.cpp
>
> Modified: cfe/trunk/include/clang/AST/CommentLexer.h
> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/AST/CommentLexer.h?rev=173850&r1=173849&r2=173850&view=diff
> ==============================================================================
> --- cfe/trunk/include/clang/AST/CommentLexer.h (original)
> +++ cfe/trunk/include/clang/AST/CommentLexer.h Tue Jan 29 17:42:26 2013
> @@ -282,11 +282,18 @@ private:
> /// it stands for (e.g., "<").
> StringRef resolveHTMLNamedCharacterReference(StringRef Name) const;
>
> + /// Given a Doxygen-supported named character reference (e.g., "™"),
> + /// it returns its UTF8 encoding.
> + StringRef HTMLDoxygenCharacterReference(StringRef Name) const;
> +
> /// Given a Unicode codepoint as base-10 integer, return the character.
> StringRef resolveHTMLDecimalCharacterReference(StringRef Name) const;
>
> /// Given a Unicode codepoint as base-16 integer, return the character.
> StringRef resolveHTMLHexCharacterReference(StringRef Name) const;
> +
> + /// Helper routine to do part of the work for resolveHTMLHexCharacterReference.
> + StringRef helperResolveHTMLHexCharacterReference(unsigned CodePoint) const;
>
> void formTokenWithChars(Token &Result, const char *TokEnd,
> tok::TokenKind Kind) {
>
> Modified: cfe/trunk/lib/AST/CommentLexer.cpp
> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/AST/CommentLexer.cpp?rev=173850&r1=173849&r2=173850&view=diff
> ==============================================================================
> --- cfe/trunk/lib/AST/CommentLexer.cpp (original)
> +++ cfe/trunk/lib/AST/CommentLexer.cpp Tue Jan 29 17:42:26 2013
> @@ -34,6 +34,31 @@ bool isHTMLHexCharacterReferenceCharacte
>
> } // unnamed namespace
>
> +static unsigned getCodePoint(StringRef Name) {
> + unsigned CodePoint = 0;
> + for (unsigned i = 0, e = Name.size(); i != e; ++i) {
> + CodePoint *= 16;
> + const char C = Name[i];
> + assert(isHTMLHexCharacterReferenceCharacter(C));
> + CodePoint += llvm::hexDigitValue(C);
> + }
> + return CodePoint;
> +}
> +
> +StringRef Lexer::helperResolveHTMLHexCharacterReference(unsigned CodePoint) const {
> + char *Resolved = Allocator.Allocate<char>(UNI_MAX_UTF8_BYTES_PER_CODE_POINT);
> + char *ResolvedPtr = Resolved;
> + if (ConvertCodePointToUTF8(CodePoint, ResolvedPtr))
> + return StringRef(Resolved, ResolvedPtr - Resolved);
> + else
> + return StringRef();
> +}
> +
> +StringRef Lexer::resolveHTMLHexCharacterReference(StringRef Name) const {
> + unsigned CodePoint = getCodePoint(Name);
> + return helperResolveHTMLHexCharacterReference(CodePoint);
> +}
> +
> StringRef Lexer::resolveHTMLNamedCharacterReference(StringRef Name) const {
> return llvm::StringSwitch<StringRef>(Name)
> .Case("amp", "&")
> @@ -41,8 +66,154 @@ StringRef Lexer::resolveHTMLNamedCharact
> .Case("gt", ">")
> .Case("quot", "\"")
> .Case("apos", "\'")
> + .Case("minus", "-")
> + .Case("sim", "~")
Sorry, but this is wrong: sim is U+223C, minus is U+2212.
> .Default("");
> }
> +
> +StringRef Lexer::HTMLDoxygenCharacterReference(StringRef Name) const {
> + return llvm::StringSwitch<StringRef>(Name)
> + .Case("copy", helperResolveHTMLHexCharacterReference(0x000A9))
> + .Case("trade", helperResolveHTMLHexCharacterReference(0x02122))
> + .Case("reg", helperResolveHTMLHexCharacterReference(0x000AE))
...
Is based on the subset described in
http://www.stack.nl/~dimitri/doxygen/manual/htmlcmds.html ?
I think we can do better than this:
(1) linear search is not great;
(2) allocation is not great either.
This needs some tablegen magic -- will try to hack up something tomorrow.
Dmitri
--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
More information about the cfe-commits
mailing list