[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 02:14:37 PDT 2014

Hello,

In one of the threads on cfe-commits I was asked by Tobias to provide
a rationale for adding more HTML-related validation facilities in
Clang's comment parsing.

HTML is an indivisible part of Doxygen syntax.  It is impossible to
parse Doxygen without not only merely parsing, but doing semantic
analysis on HTML tags.

For example, paragraph splitting is more complex than just finding an
empty line.

/// <b>Aaa
///
/// Bbb
void f();

/// <table>
/// <tr><td>Aaa</tr></td>
///
/// </table>
void g();

Somehow (I am not saying that the rules make sense, it is just what it
is), Doxygen interprets this like this:

1. for f():

<p><b>Aaa</b></p>
<p><b>Bbb </b></p>

An unterminated <b> tag started to span multiple paragraphs.

2. for g():

<table class="doxtable">
<tr>
<td><p class="starttd">Aaa</p>
<p class="endtd"></p>
</td></tr>
</table>

An empty line between table tags made Doxygen add a second paragraph
to a table cell that had its content clearly specified.

Judging just from these two simple examples, it is clear that in order
to parse embedded HTML in Doxygen so that the output actually takes
the HTML markup into account, requires *semantic* analysis of HTML
tags, and transformation of the HTML AST.

It is a non-trivial amount of work to implement this, and I did look
for HTML libraries that could help us in doing so.  libtidy [1] is a
nice one, except that I got the impression that it is "stabilized to
the point of becoming unmaintained" -- there are no releases, code is
available through cvs only, and it was not updated for HTML5.  There
is an experimental HTML5 fork of it [2], which was not updated for
more than 2 years, and probably does not correspond to the current
HTML5 draft.

But even if libtidy did completely support HTML5, its interface is not
suitable for fine-grained parsing and AST manipulation that we need.
The interface accepts only complete HTML docs for parsing, while Clang
deals with fragments.  Constructing the HTML AST though libtidy just
to figure out what the tag name is is not going to deliver good
performance either.

Apart from libtidy, I did not find other *lightweight* libraries (not
HTML rendering engines) that provide low-level manipulation that we
need.

But parsing and doing semantic analysis correctly is only half of the
story.  Sanitizing the output is important, otherwise Clang clients
can not use the HTML parts of comments, and have to re-do the parsing
work, now with the intent of sanitizing the output.  I think it is
reasonable to state that almost all clients want the output as
well-formed HTML and sanitized of javascript.  It rarely (if ever)
makes sense to put executable javascript into comments anyway.

I hope this addresses everyone's concerns.

Dmitri

[1] http://tidy.sourceforge.net/
[2] https://github.com/w3c/tidy-html5

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/