[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 05:08:27 PDT 2014

On 28/04/2014 11:14, Dmitri Gribenko wrote:
> Hello,
>
> In one of the threads on cfe-commits I was asked by Tobias to provide
> a rationale for adding more HTML-related validation facilities in
> Clang's comment parsing.

Thanks Dimitri. This is very informative.

> HTML is an indivisible part of Doxygen syntax.  It is impossible to
> parse Doxygen without not only merely parsing, but doing semantic
> analysis on HTML tags.
>
> For example, paragraph splitting is more complex than just finding an
> empty line.
>
> /// <b>Aaa
> ///
> /// Bbb
> void f();
>
> /// <table>
> /// <tr><td>Aaa</tr></td>
> ///
> /// </table>
> void g();
>
> Somehow (I am not saying that the rules make sense, it is just what it
> is), Doxygen interprets this like this:
>
> 1. for f():
>
> <p><b>Aaa</b></p>
> <p><b>Bbb </b></p>
>
> An unterminated <b> tag started to span multiple paragraphs.
>
> 2. for g():
>
> <table class="doxtable">
> <tr>
> <td><p class="starttd">Aaa</p>
> <p class="endtd"></p>
> </td></tr>
> </table>
>
> An empty line between table tags made Doxygen add a second paragraph
> to a table cell that had its content clearly specified.
>
> Judging just from these two simple examples, it is clear that in order
> to parse embedded HTML in Doxygen so that the output actually takes
> the HTML markup into account, requires *semantic* analysis of HTML
> tags, and transformation of the HTML AST.
>
> It is a non-trivial amount of work to implement this, and I did look
> for HTML libraries that could help us in doing so.  libtidy [1] is a
> nice one, except that I got the impression that it is "stabilized to
> the point of becoming unmaintained" -- there are no releases, code is
> available through cvs only, and it was not updated for HTML5.  There
> is an experimental HTML5 fork of it [2], which was not updated for
> more than 2 years, and probably does not correspond to the current
> HTML5 draft.
>
> But even if libtidy did completely support HTML5, its interface is not
> suitable for fine-grained parsing and AST manipulation that we need.
> The interface accepts only complete HTML docs for parsing, while Clang
> deals with fragments.  Constructing the HTML AST though libtidy just
> to figure out what the tag name is is not going to deliver good
> performance either.
>
> Apart from libtidy, I did not find other *lightweight* libraries (not
> HTML rendering engines) that provide low-level manipulation that we
> need.
>
> But parsing and doing semantic analysis correctly is only half of the
> story.  Sanitizing the output is important, otherwise Clang clients
> can not use the HTML parts of comments, and have to re-do the parsing
> work, now with the intent of sanitizing the output.  I think it is
> reasonable to state that almost all clients want the output as
> well-formed HTML and sanitized of javascript.  It rarely (if ever)
> makes sense to put executable javascript into comments anyway.

Out of interest. What is required to sanitize HTML? Do we need a full 
HTML5 parser, including all the quirks? With javascript support? How 
large do you expect this to become? During the time the support is 
incomplete, can we provide any guarantees about the absence of javascript?

Thanks again Dimitri!

Tobias