[cfe-dev] Adding more HTML-related facilities in Doxygen comment parsing

Mon Apr 28 07:14:22 PDT 2014

On Mon, Apr 28, 2014 at 2:56 PM, Alp Toker <alp at nuanti.com> wrote:
>
> On 28/04/2014 13:45, Dmitri Gribenko wrote:
>>
>> On Mon, Apr 28, 2014 at 1:08 PM, Tobias Grosser <tobias at grosser.es> wrote:
>>>
>>> Out of interest. What is required to sanitize HTML?
>>
>> There are two different levels of sanitizing:
>> - well-formedness of HTML,
>> - absence of javascript.
>>
>> The former is harder to guarantee than the latter, but it is important
>> nevertheless, because being able to directly pass through HTML from
>> Clang's output into a webpage template and get back a document that
>> passes validation is a useful property.
>
>
> Dmitri, this may be an interesting problem to solve but it doesn't make
> sense to build it into libclang.
>
> LLVM has no procedure for 0-day vulnerabilities, contacting vendors and
> pushing updates working with the web community, nor should it. What happens
> if a 0-day cross-site-scripting attack is found and user passwords are
> stolen?
>
> This is really so far out of scope and mislayered, that it's very much a
> disservice to the few users who might actually use the facility. Why are we
> building a web technology security validator into clang that is insecure?
> That's a separate project.
>
> Ordinarily you pipe tool output through a well-maintained and up-to-date
> script that knows about browser and JavaScript quirks. Can we please just
> point users to that workflow and get on with things?

Parsing Doxygen is inherently intertwined with HTML parsing and
semantic analysis.  Doing filtering at the same level does not look
out of scope and mislayered.

>> When implementing further semantic analysis for Doxygen parsing, I
>> don't expect many quirks to come from HTML.  Most of HTML quirks are
>> about rendering, not parsing.  In fact, parsing and extraction of HTML
>> tags has been implemented for a long time already, we just have no
>> idea about their semantics (we only know if a tag may have an end tag
>> </foo> or not).  I expect more complexity in implementing undocumented
>> Doxygen rules about interaction between Doxygen markup and HTML.
>
>
> As someone who has worked on an HTML5 parser or two, and JavaScript too, I
> fail to see how the HTML/JavaScript filters in ToT serve any purpose at all
> because they are, and always will, be trivially exploitable.

I would disagree.

> ~20,000 LoC implementing XML schemas, HTML, JavaScript validators .. are all
> so intertwined it's difficult to cut things down to provide the basic
> comment callbacks and diagnostics users would benefit from.

Alp, the way you have been putting this discussion is
non-constructive.  You are trying to reuse Clang's comment parsing for
some other purpose, yet unknown.  It seems that it is hard for you to
factor the code (because it is tied to Clang's ASTs, on purpose of
providing diagnostics), but you start blaming the code and finding
deficiencies when there are none.

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/