[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 11:27:06 PDT 2014

What applications does this HTML5 validation enable?  I've tried to skim
this thread to find the big picture, but I can't find it.

Why does Clang need to validate the HTML, rather than simply associating
comments with Decls and handing them over to a client who knows the details
of Doxygen and HTML?

On Mon, Apr 28, 2014 at 2:14 AM, Dmitri Gribenko <gribozavr at gmail.com>wrote:

> Hello,
>
> In one of the threads on cfe-commits I was asked by Tobias to provide
> a rationale for adding more HTML-related validation facilities in
> Clang's comment parsing.
>
> HTML is an indivisible part of Doxygen syntax.  It is impossible to
> parse Doxygen without not only merely parsing, but doing semantic
> analysis on HTML tags.
>
> For example, paragraph splitting is more complex than just finding an
> empty line.
>
> /// <b>Aaa
> ///
> /// Bbb
> void f();
>
> /// <table>
> /// <tr><td>Aaa</tr></td>
> ///
> /// </table>
> void g();
>
> Somehow (I am not saying that the rules make sense, it is just what it
> is), Doxygen interprets this like this:
>
> 1. for f():
>
> <p><b>Aaa</b></p>
> <p><b>Bbb </b></p>
>
> An unterminated <b> tag started to span multiple paragraphs.
>
> 2. for g():
>
> <table class="doxtable">
> <tr>
> <td><p class="starttd">Aaa</p>
> <p class="endtd"></p>
> </td></tr>
> </table>
>
> An empty line between table tags made Doxygen add a second paragraph
> to a table cell that had its content clearly specified.
>
> Judging just from these two simple examples, it is clear that in order
> to parse embedded HTML in Doxygen so that the output actually takes
> the HTML markup into account, requires *semantic* analysis of HTML
> tags, and transformation of the HTML AST.
>
> It is a non-trivial amount of work to implement this, and I did look
> for HTML libraries that could help us in doing so.  libtidy [1] is a
> nice one, except that I got the impression that it is "stabilized to
> the point of becoming unmaintained" -- there are no releases, code is
> available through cvs only, and it was not updated for HTML5.  There
> is an experimental HTML5 fork of it [2], which was not updated for
> more than 2 years, and probably does not correspond to the current
> HTML5 draft.
>
> But even if libtidy did completely support HTML5, its interface is not
> suitable for fine-grained parsing and AST manipulation that we need.
> The interface accepts only complete HTML docs for parsing, while Clang
> deals with fragments.  Constructing the HTML AST though libtidy just
> to figure out what the tag name is is not going to deliver good
> performance either.
>
> Apart from libtidy, I did not find other *lightweight* libraries (not
> HTML rendering engines) that provide low-level manipulation that we
> need.
>
> But parsing and doing semantic analysis correctly is only half of the
> story.  Sanitizing the output is important, otherwise Clang clients
> can not use the HTML parts of comments, and have to re-do the parsing
> work, now with the intent of sanitizing the output.  I think it is
> reasonable to state that almost all clients want the output as
> well-formed HTML and sanitized of javascript.  It rarely (if ever)
> makes sense to put executable javascript into comments anyway.
>
> I hope this addresses everyone's concerns.
>
> Dmitri
>
> [1] http://tidy.sourceforge.net/
> [2] https://github.com/w3c/tidy-html5
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140428/89a868bb/attachment.html>