r173850 - [Doc parsing] Patch to parse Doxygen-supported HTML character

Tue Jan 29 16:01:44 PST 2013

Hi Fariborz,

On Wed, Jan 30, 2013 at 1:42 AM, Fariborz Jahanian <fjahanian at apple.com> wrote:
> Author: fjahanian
> Date: Tue Jan 29 17:42:26 2013
> New Revision: 173850
>
> URL: http://llvm.org/viewvc/llvm-project?rev=173850&view=rev
> Log:
> [Doc parsing] Patch to parse Doxygen-supported HTML character
> references to their UTIF-8 encoding. Reviewed offline by Doug.
> // rdar://12392215
>
> Added:
>     cfe/trunk/test/Index/special-html-characters.m
> Modified:
>     cfe/trunk/include/clang/AST/CommentLexer.h
>     cfe/trunk/lib/AST/CommentLexer.cpp
>
> Modified: cfe/trunk/include/clang/AST/CommentLexer.h
> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/AST/CommentLexer.h?rev=173850&r1=173849&r2=173850&view=diff
> ==============================================================================
> --- cfe/trunk/include/clang/AST/CommentLexer.h (original)
> +++ cfe/trunk/include/clang/AST/CommentLexer.h Tue Jan 29 17:42:26 2013
> @@ -282,11 +282,18 @@ private:
>    /// it stands for (e.g., "<").
>    StringRef resolveHTMLNamedCharacterReference(StringRef Name) const;
>
> +  /// Given a Doxygen-supported named character reference (e.g., "™"),
> +  /// it returns its UTF8 encoding.
> +  StringRef HTMLDoxygenCharacterReference(StringRef Name) const;
> +
>    /// Given a Unicode codepoint as base-10 integer, return the character.
>    StringRef resolveHTMLDecimalCharacterReference(StringRef Name) const;
>
>    /// Given a Unicode codepoint as base-16 integer, return the character.
>    StringRef resolveHTMLHexCharacterReference(StringRef Name) const;
> +
> +  /// Helper routine to do part of the work for resolveHTMLHexCharacterReference.
> +  StringRef helperResolveHTMLHexCharacterReference(unsigned CodePoint) const;
>
>    void formTokenWithChars(Token &Result, const char *TokEnd,
>                            tok::TokenKind Kind) {
>
> Modified: cfe/trunk/lib/AST/CommentLexer.cpp
> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/AST/CommentLexer.cpp?rev=173850&r1=173849&r2=173850&view=diff
> ==============================================================================
> --- cfe/trunk/lib/AST/CommentLexer.cpp (original)
> +++ cfe/trunk/lib/AST/CommentLexer.cpp Tue Jan 29 17:42:26 2013
> @@ -34,6 +34,31 @@ bool isHTMLHexCharacterReferenceCharacte
>
>  } // unnamed namespace
>
> +static unsigned getCodePoint(StringRef Name) {
> +  unsigned CodePoint = 0;
> +  for (unsigned i = 0, e = Name.size(); i != e; ++i) {
> +    CodePoint *= 16;
> +    const char C = Name[i];
> +    assert(isHTMLHexCharacterReferenceCharacter(C));
> +    CodePoint += llvm::hexDigitValue(C);
> +  }
> +  return CodePoint;
> +}
> +
> +StringRef Lexer::helperResolveHTMLHexCharacterReference(unsigned CodePoint) const {
> +  char *Resolved = Allocator.Allocate<char>(UNI_MAX_UTF8_BYTES_PER_CODE_POINT);
> +  char *ResolvedPtr = Resolved;
> +  if (ConvertCodePointToUTF8(CodePoint, ResolvedPtr))
> +    return StringRef(Resolved, ResolvedPtr - Resolved);
> +  else
> +    return StringRef();
> +}
> +
> +StringRef Lexer::resolveHTMLHexCharacterReference(StringRef Name) const {
> +  unsigned CodePoint = getCodePoint(Name);
> +  return helperResolveHTMLHexCharacterReference(CodePoint);
> +}
> +
>  StringRef Lexer::resolveHTMLNamedCharacterReference(StringRef Name) const {
>    return llvm::StringSwitch<StringRef>(Name)
>        .Case("amp", "&")
> @@ -41,8 +66,154 @@ StringRef Lexer::resolveHTMLNamedCharact
>        .Case("gt", ">")
>        .Case("quot", "\"")
>        .Case("apos", "\'")
> +      .Case("minus", "-")
> +      .Case("sim", "~")

Sorry, but this is wrong: sim is U+223C, minus is U+2212.

>        .Default("");
>  }
> +
> +StringRef Lexer::HTMLDoxygenCharacterReference(StringRef Name) const {
> +  return llvm::StringSwitch<StringRef>(Name)
> +  .Case("copy", helperResolveHTMLHexCharacterReference(0x000A9))
> +  .Case("trade",        helperResolveHTMLHexCharacterReference(0x02122))
> +  .Case("reg",  helperResolveHTMLHexCharacterReference(0x000AE))

...

Is based on the subset described in
http://www.stack.nl/~dimitri/doxygen/manual/htmlcmds.html ?

I think we can do better than this:

(1) linear search is not great;
(2) allocation is not great either.

This needs some tablegen magic -- will try to hack up something tomorrow.

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/