r173850 - [Doc parsing] Patch to parse Doxygen-supported HTML character

Tue Jan 29 16:06:08 PST 2013

On Jan 29, 2013, at 4:01 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:

> Hi Fariborz,
> 
> On Wed, Jan 30, 2013 at 1:42 AM, Fariborz Jahanian <fjahanian at apple.com> wrote:
>> Author: fjahanian
>> Date: Tue Jan 29 17:42:26 2013
>> New Revision: 173850
>> 
>> URL: http://llvm.org/viewvc/llvm-project?rev=173850&view=rev
>> Log:
>> [Doc parsing] Patch to parse Doxygen-supported HTML character
>> references to their UTIF-8 encoding. Reviewed offline by Doug.
>> // rdar://12392215
>> 
>> Added:
>>    cfe/trunk/test/Index/special-html-characters.m
>> Modified:
>>    cfe/trunk/include/clang/AST/CommentLexer.h
>>    cfe/trunk/lib/AST/CommentLexer.cpp
>> 
>> Modified: cfe/trunk/include/clang/AST/CommentLexer.h
>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/AST/CommentLexer.h?rev=173850&r1=173849&r2=173850&view=diff
>> ==============================================================================
>> --- cfe/trunk/include/clang/AST/CommentLexer.h (original)
>> +++ cfe/trunk/include/clang/AST/CommentLexer.h Tue Jan 29 17:42:26 2013
>> @@ -282,11 +282,18 @@ private:
>>   /// it stands for (e.g., "<").
>>   StringRef resolveHTMLNamedCharacterReference(StringRef Name) const;
>> 
>> +  /// Given a Doxygen-supported named character reference (e.g., "™"),
>> +  /// it returns its UTF8 encoding.
>> +  StringRef HTMLDoxygenCharacterReference(StringRef Name) const;
>> +
>>   /// Given a Unicode codepoint as base-10 integer, return the character.
>>   StringRef resolveHTMLDecimalCharacterReference(StringRef Name) const;
>> 
>>   /// Given a Unicode codepoint as base-16 integer, return the character.
>>   StringRef resolveHTMLHexCharacterReference(StringRef Name) const;
>> +
>> +  /// Helper routine to do part of the work for resolveHTMLHexCharacterReference.
>> +  StringRef helperResolveHTMLHexCharacterReference(unsigned CodePoint) const;
>> 
>>   void formTokenWithChars(Token &Result, const char *TokEnd,
>>                           tok::TokenKind Kind) {
>> 
>> Modified: cfe/trunk/lib/AST/CommentLexer.cpp
>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/AST/CommentLexer.cpp?rev=173850&r1=173849&r2=173850&view=diff
>> ==============================================================================
>> --- cfe/trunk/lib/AST/CommentLexer.cpp (original)
>> +++ cfe/trunk/lib/AST/CommentLexer.cpp Tue Jan 29 17:42:26 2013
>> @@ -34,6 +34,31 @@ bool isHTMLHexCharacterReferenceCharacte
>> 
>> } // unnamed namespace
>> 
>> +static unsigned getCodePoint(StringRef Name) {
>> +  unsigned CodePoint = 0;
>> +  for (unsigned i = 0, e = Name.size(); i != e; ++i) {
>> +    CodePoint *= 16;
>> +    const char C = Name[i];
>> +    assert(isHTMLHexCharacterReferenceCharacter(C));
>> +    CodePoint += llvm::hexDigitValue(C);
>> +  }
>> +  return CodePoint;
>> +}
>> +
>> +StringRef Lexer::helperResolveHTMLHexCharacterReference(unsigned CodePoint) const {
>> +  char *Resolved = Allocator.Allocate<char>(UNI_MAX_UTF8_BYTES_PER_CODE_POINT);
>> +  char *ResolvedPtr = Resolved;
>> +  if (ConvertCodePointToUTF8(CodePoint, ResolvedPtr))
>> +    return StringRef(Resolved, ResolvedPtr - Resolved);
>> +  else
>> +    return StringRef();
>> +}
>> +
>> +StringRef Lexer::resolveHTMLHexCharacterReference(StringRef Name) const {
>> +  unsigned CodePoint = getCodePoint(Name);
>> +  return helperResolveHTMLHexCharacterReference(CodePoint);
>> +}
>> +
>> StringRef Lexer::resolveHTMLNamedCharacterReference(StringRef Name) const {
>>   return llvm::StringSwitch<StringRef>(Name)
>>       .Case("amp", "&")
>> @@ -41,8 +66,154 @@ StringRef Lexer::resolveHTMLNamedCharact
>>       .Case("gt", ">")
>>       .Case("quot", "\"")
>>       .Case("apos", "\'")
>> +      .Case("minus", "-")
>> +      .Case("sim", "~")
> 
> Sorry, but this is wrong: sim is U+223C, minus is U+2212.

Old code of mine. Not needed here. WIll remove shortly.

> 
>>       .Default("");
>> }
>> +
>> +StringRef Lexer::HTMLDoxygenCharacterReference(StringRef Name) const {
>> +  return llvm::StringSwitch<StringRef>(Name)
>> +  .Case("copy", helperResolveHTMLHexCharacterReference(0x000A9))
>> +  .Case("trade",        helperResolveHTMLHexCharacterReference(0x02122))
>> +  .Case("reg",  helperResolveHTMLHexCharacterReference(0x000AE))
> 
> ...
> 
> Is based on the subset described in
> http://www.stack.nl/~dimitri/doxygen/manual/htmlcmds.html ?
Yes.

> 
> I think we can do better than this:
> 
> (1) linear search is not great;
> (2) allocation is not great either.
> 
> This needs some tablegen magic -- will try to hack up something tomorrow.

Great. Thanks.
- Fariborz

> 
> Dmitri
> 
> -- 
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/