[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing
Dmitri Gribenko
gribozavr at gmail.com
Mon Apr 28 05:45:55 PDT 2014
On Mon, Apr 28, 2014 at 1:08 PM, Tobias Grosser <tobias at grosser.es> wrote:
> Out of interest. What is required to sanitize HTML?
There are two different levels of sanitizing:
- well-formedness of HTML,
- absence of javascript.
The former is harder to guarantee than the latter, but it is important
nevertheless, because being able to directly pass through HTML from
Clang's output into a webpage template and get back a document that
passes validation is a useful property.
> Do we need a full HTML5
> parser, including all the quirks? With javascript support? How large do you
> expect this to become? During the time the support is incomplete, can we
> provide any guarantees about the absence of javascript?
Our filtering is based on a whitelist for HTML tags and a blacklist
for attributes. I did my best to look though HTML5 spec and find
attributes that can contain embedded javascript, and added those to
the blacklist. I think our filtering is reasonable and should not
allow any javascript according to HTML5 spec. But a black list is a
black list, for example if a certain browser supports a non-standard
attribute with embedded JS, we will not catch that.
When implementing further semantic analysis for Doxygen parsing, I
don't expect many quirks to come from HTML. Most of HTML quirks are
about rendering, not parsing. In fact, parsing and extraction of HTML
tags has been implemented for a long time already, we just have no
idea about their semantics (we only know if a tag may have an end tag
</foo> or not). I expect more complexity in implementing undocumented
Doxygen rules about interaction between Doxygen markup and HTML.
Dmitri
--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
More information about the cfe-dev
mailing list