[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 05:51:47 PDT 2014

On 28/04/2014 14:45, Dmitri Gribenko wrote:
> On Mon, Apr 28, 2014 at 1:08 PM, Tobias Grosser <tobias at grosser.es> wrote:
>> Out of interest. What is required to sanitize HTML?
>
> There are two different levels of sanitizing:
> - well-formedness of HTML,
> - absence of javascript.
>
> The former is harder to guarantee than the latter, but it is important
> nevertheless, because being able to directly pass through HTML from
> Clang's output into a webpage template and get back a document that
> passes validation is a useful property.
>
>> Do we need a full HTML5
>> parser, including all the quirks? With javascript support? How large do you
>> expect this to become? During the time the support is incomplete, can we
>> provide any guarantees about the absence of javascript?
>
> Our filtering is based on a whitelist for HTML tags and a blacklist
> for attributes.  I did my best to look though HTML5 spec and find
> attributes that can contain embedded javascript, and added those to
> the blacklist.  I think our filtering is reasonable and should not
> allow any javascript according to HTML5 spec.  But a black list is a
> black list, for example if a certain browser supports a non-standard
> attribute with embedded JS, we will not catch that.
>
> When implementing further semantic analysis for Doxygen parsing, I
> don't expect many quirks to come from HTML.  Most of HTML quirks are
> about rendering, not parsing.  In fact, parsing and extraction of HTML
> tags has been implemented for a long time already, we just have no
> idea about their semantics (we only know if a tag may have an end tag
> </foo> or not).  I expect more complexity in implementing undocumented
> Doxygen rules about interaction between Doxygen markup and HTML.

Very interesting. Thanks a lot.

Tobias