[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 05:45:55 PDT 2014

On Mon, Apr 28, 2014 at 1:08 PM, Tobias Grosser <tobias at grosser.es> wrote:
> Out of interest. What is required to sanitize HTML?

There are two different levels of sanitizing:
- well-formedness of HTML,
- absence of javascript.

The former is harder to guarantee than the latter, but it is important
nevertheless, because being able to directly pass through HTML from
Clang's output into a webpage template and get back a document that
passes validation is a useful property.

> Do we need a full HTML5
> parser, including all the quirks? With javascript support? How large do you
> expect this to become? During the time the support is incomplete, can we
> provide any guarantees about the absence of javascript?

Our filtering is based on a whitelist for HTML tags and a blacklist
for attributes.  I did my best to look though HTML5 spec and find
attributes that can contain embedded javascript, and added those to
the blacklist.  I think our filtering is reasonable and should not
allow any javascript according to HTML5 spec.  But a black list is a
black list, for example if a certain browser supports a non-standard
attribute with embedded JS, we will not catch that.

When implementing further semantic analysis for Doxygen parsing, I
don't expect many quirks to come from HTML.  Most of HTML quirks are
about rendering, not parsing.  In fact, parsing and extraction of HTML
tags has been implemented for a long time already, we just have no
idea about their semantics (we only know if a tag may have an end tag
</foo> or not).  I expect more complexity in implementing undocumented
Doxygen rules about interaction between Doxygen markup and HTML.

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/