r173850 - [Doc parsing] Patch to parse Doxygen-supported HTML character
jahanian
fjahanian at apple.com
Tue Jan 29 16:06:08 PST 2013
On Jan 29, 2013, at 4:01 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
> Hi Fariborz,
>
> On Wed, Jan 30, 2013 at 1:42 AM, Fariborz Jahanian <fjahanian at apple.com> wrote:
>> Author: fjahanian
>> Date: Tue Jan 29 17:42:26 2013
>> New Revision: 173850
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=173850&view=rev
>> Log:
>> [Doc parsing] Patch to parse Doxygen-supported HTML character
>> references to their UTIF-8 encoding. Reviewed offline by Doug.
>> // rdar://12392215
>>
>> Added:
>> cfe/trunk/test/Index/special-html-characters.m
>> Modified:
>> cfe/trunk/include/clang/AST/CommentLexer.h
>> cfe/trunk/lib/AST/CommentLexer.cpp
>>
>> Modified: cfe/trunk/include/clang/AST/CommentLexer.h
>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/include/clang/AST/CommentLexer.h?rev=173850&r1=173849&r2=173850&view=diff
>> ==============================================================================
>> --- cfe/trunk/include/clang/AST/CommentLexer.h (original)
>> +++ cfe/trunk/include/clang/AST/CommentLexer.h Tue Jan 29 17:42:26 2013
>> @@ -282,11 +282,18 @@ private:
>> /// it stands for (e.g., "<").
>> StringRef resolveHTMLNamedCharacterReference(StringRef Name) const;
>>
>> + /// Given a Doxygen-supported named character reference (e.g., "™"),
>> + /// it returns its UTF8 encoding.
>> + StringRef HTMLDoxygenCharacterReference(StringRef Name) const;
>> +
>> /// Given a Unicode codepoint as base-10 integer, return the character.
>> StringRef resolveHTMLDecimalCharacterReference(StringRef Name) const;
>>
>> /// Given a Unicode codepoint as base-16 integer, return the character.
>> StringRef resolveHTMLHexCharacterReference(StringRef Name) const;
>> +
>> + /// Helper routine to do part of the work for resolveHTMLHexCharacterReference.
>> + StringRef helperResolveHTMLHexCharacterReference(unsigned CodePoint) const;
>>
>> void formTokenWithChars(Token &Result, const char *TokEnd,
>> tok::TokenKind Kind) {
>>
>> Modified: cfe/trunk/lib/AST/CommentLexer.cpp
>> URL: http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/AST/CommentLexer.cpp?rev=173850&r1=173849&r2=173850&view=diff
>> ==============================================================================
>> --- cfe/trunk/lib/AST/CommentLexer.cpp (original)
>> +++ cfe/trunk/lib/AST/CommentLexer.cpp Tue Jan 29 17:42:26 2013
>> @@ -34,6 +34,31 @@ bool isHTMLHexCharacterReferenceCharacte
>>
>> } // unnamed namespace
>>
>> +static unsigned getCodePoint(StringRef Name) {
>> + unsigned CodePoint = 0;
>> + for (unsigned i = 0, e = Name.size(); i != e; ++i) {
>> + CodePoint *= 16;
>> + const char C = Name[i];
>> + assert(isHTMLHexCharacterReferenceCharacter(C));
>> + CodePoint += llvm::hexDigitValue(C);
>> + }
>> + return CodePoint;
>> +}
>> +
>> +StringRef Lexer::helperResolveHTMLHexCharacterReference(unsigned CodePoint) const {
>> + char *Resolved = Allocator.Allocate<char>(UNI_MAX_UTF8_BYTES_PER_CODE_POINT);
>> + char *ResolvedPtr = Resolved;
>> + if (ConvertCodePointToUTF8(CodePoint, ResolvedPtr))
>> + return StringRef(Resolved, ResolvedPtr - Resolved);
>> + else
>> + return StringRef();
>> +}
>> +
>> +StringRef Lexer::resolveHTMLHexCharacterReference(StringRef Name) const {
>> + unsigned CodePoint = getCodePoint(Name);
>> + return helperResolveHTMLHexCharacterReference(CodePoint);
>> +}
>> +
>> StringRef Lexer::resolveHTMLNamedCharacterReference(StringRef Name) const {
>> return llvm::StringSwitch<StringRef>(Name)
>> .Case("amp", "&")
>> @@ -41,8 +66,154 @@ StringRef Lexer::resolveHTMLNamedCharact
>> .Case("gt", ">")
>> .Case("quot", "\"")
>> .Case("apos", "\'")
>> + .Case("minus", "-")
>> + .Case("sim", "~")
>
> Sorry, but this is wrong: sim is U+223C, minus is U+2212.
Old code of mine. Not needed here. WIll remove shortly.
>
>> .Default("");
>> }
>> +
>> +StringRef Lexer::HTMLDoxygenCharacterReference(StringRef Name) const {
>> + return llvm::StringSwitch<StringRef>(Name)
>> + .Case("copy", helperResolveHTMLHexCharacterReference(0x000A9))
>> + .Case("trade", helperResolveHTMLHexCharacterReference(0x02122))
>> + .Case("reg", helperResolveHTMLHexCharacterReference(0x000AE))
>
> ...
>
> Is based on the subset described in
> http://www.stack.nl/~dimitri/doxygen/manual/htmlcmds.html ?
Yes.
>
> I think we can do better than this:
>
> (1) linear search is not great;
> (2) allocation is not great either.
>
> This needs some tablegen magic -- will try to hack up something tomorrow.
Great. Thanks.
- Fariborz
>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
More information about the cfe-commits
mailing list