[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing
Alp Toker
alp at nuanti.com
Mon Apr 28 06:56:14 PDT 2014
On 28/04/2014 13:45, Dmitri Gribenko wrote:
> On Mon, Apr 28, 2014 at 1:08 PM, Tobias Grosser <tobias at grosser.es> wrote:
>> Out of interest. What is required to sanitize HTML?
> There are two different levels of sanitizing:
> - well-formedness of HTML,
> - absence of javascript.
>
> The former is harder to guarantee than the latter, but it is important
> nevertheless, because being able to directly pass through HTML from
> Clang's output into a webpage template and get back a document that
> passes validation is a useful property.
Dmitri, this may be an interesting problem to solve but it doesn't make
sense to build it into libclang.
LLVM has no procedure for 0-day vulnerabilities, contacting vendors and
pushing updates working with the web community, nor should it. What
happens if a 0-day cross-site-scripting attack is found and user
passwords are stolen?
This is really so far out of scope and mislayered, that it's very much a
disservice to the few users who might actually use the facility. Why are
we building a web technology security validator into clang that is
insecure? That's a separate project.
Ordinarily you pipe tool output through a well-maintained and up-to-date
script that knows about browser and JavaScript quirks. Can we please
just point users to that workflow and get on with things?
>
>> Do we need a full HTML5
>> parser, including all the quirks? With javascript support? How large do you
>> expect this to become? During the time the support is incomplete, can we
>> provide any guarantees about the absence of javascript?
> Our filtering is based on a whitelist for HTML tags and a blacklist
> for attributes. I did my best to look though HTML5 spec and find
> attributes that can contain embedded javascript, and added those to
> the blacklist. I think our filtering is reasonable and should not
> allow any javascript according to HTML5 spec. But a black list is a
> black list, for example if a certain browser supports a non-standard
> attribute with embedded JS, we will not catch that.
This is not a problem the compiler should be dealing with on any level,
let alone by hand. This is a significant chunk of code that needn't be
there.
>
> When implementing further semantic analysis for Doxygen parsing, I
> don't expect many quirks to come from HTML. Most of HTML quirks are
> about rendering, not parsing. In fact, parsing and extraction of HTML
> tags has been implemented for a long time already, we just have no
> idea about their semantics (we only know if a tag may have an end tag
> </foo> or not). I expect more complexity in implementing undocumented
> Doxygen rules about interaction between Doxygen markup and HTML.
As someone who has worked on an HTML5 parser or two, and JavaScript too,
I fail to see how the HTML/JavaScript filters in ToT serve any purpose
at all because they are, and always will, be trivially exploitable.
~20,000 LoC implementing XML schemas, HTML, JavaScript validators .. are
all so intertwined it's difficult to cut things down to provide the
basic comment callbacks and diagnostics users would benefit from.
Alp.
--
http://www.nuanti.com
the browser experts
More information about the cfe-dev
mailing list