[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 06:56:14 PDT 2014

On 28/04/2014 13:45, Dmitri Gribenko wrote:
> On Mon, Apr 28, 2014 at 1:08 PM, Tobias Grosser <tobias at grosser.es> wrote:
>> Out of interest. What is required to sanitize HTML?
> There are two different levels of sanitizing:
> - well-formedness of HTML,
> - absence of javascript.
>
> The former is harder to guarantee than the latter, but it is important
> nevertheless, because being able to directly pass through HTML from
> Clang's output into a webpage template and get back a document that
> passes validation is a useful property.

Dmitri, this may be an interesting problem to solve but it doesn't make 
sense to build it into libclang.

LLVM has no procedure for 0-day vulnerabilities, contacting vendors and 
pushing updates working with the web community, nor should it. What 
happens if a 0-day cross-site-scripting attack is found and user 
passwords are stolen?

This is really so far out of scope and mislayered, that it's very much a 
disservice to the few users who might actually use the facility. Why are 
we building a web technology security validator into clang that is 
insecure? That's a separate project.

Ordinarily you pipe tool output through a well-maintained and up-to-date 
script that knows about browser and JavaScript quirks. Can we please 
just point users to that workflow and get on with things?

>
>> Do we need a full HTML5
>> parser, including all the quirks? With javascript support? How large do you
>> expect this to become? During the time the support is incomplete, can we
>> provide any guarantees about the absence of javascript?
> Our filtering is based on a whitelist for HTML tags and a blacklist
> for attributes.  I did my best to look though HTML5 spec and find
> attributes that can contain embedded javascript, and added those to
> the blacklist.  I think our filtering is reasonable and should not
> allow any javascript according to HTML5 spec.  But a black list is a
> black list, for example if a certain browser supports a non-standard
> attribute with embedded JS, we will not catch that.

This is not a problem the compiler should be dealing with on any level, 
let alone by hand. This is a significant chunk of code that needn't be 
there.

>
> When implementing further semantic analysis for Doxygen parsing, I
> don't expect many quirks to come from HTML.  Most of HTML quirks are
> about rendering, not parsing.  In fact, parsing and extraction of HTML
> tags has been implemented for a long time already, we just have no
> idea about their semantics (we only know if a tag may have an end tag
> </foo> or not).  I expect more complexity in implementing undocumented
> Doxygen rules about interaction between Doxygen markup and HTML.

As someone who has worked on an HTML5 parser or two, and JavaScript too, 
I fail to see how the HTML/JavaScript filters in ToT serve any purpose 
at all because they are, and always will, be trivially exploitable.

~20,000 LoC implementing XML schemas, HTML, JavaScript validators .. are 
all so intertwined it's difficult to cut things down to provide the 
basic comment callbacks and diagnostics users would benefit from.

Alp.

-- 
http://www.nuanti.com
the browser experts