[cfe-commits] [PATCH] Lexer for structured comments

Tue Jun 26 11:30:15 PDT 2012

On Jun 25, 2012, at 2:37 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:

> Hi Doug,
> 
> Thank you for the review!  Comments inline, updated patch attached.
> 
> On Mon, Jun 25, 2012 at 10:45 AM, Douglas Gregor <dgregor at apple.com> wrote:
>> +namespace tok {
>> +enum TokenKind {
>> +  eof,
>> +  newline,
>> +  text,
>> +  command,
>> +  verbatim_block_begin,
>> +  verbatim_block_line,
>> +  verbatim_block_end,
>> +  verbatim_line,
>> +  html_tag_open_name,
>> +  html_tag_open_attr,
>> 
>> I find the choice of html_tag_open_name/html_tag_open_attr interesting. Presumably, you're doing this so that we don't need to have '=' and string literals as token kinds (because we generally don't want them as token kinds), and that makes sense. However, shouldn't there be an html_tag_open_end token as well, to capture the '>'? Otherwise, the parser has to either intuit the ending tag based on having a non-html_tag_open* token or poke at the last character in each html_tag_open_* token.
> 
> The idea was to eat html_tag_open_attr until we see a
> non-html_tag_open_attr token.
> 
>> I see that you don't have location information for the '=' or for the value in TextValue2, although I guess you can intuit it from the length?
> 
> Oh, the locations for '=' and '>' are lost...  Seems like I'll need to
> introduce '<', html_ident, '=', html_quoted_string, '>' as tokens in
> html state.  Done.

I like this a lot better, thanks!

>> +  /// Registered verbatim-like block commands.
>> +  VerbatimBlockCommandVector VerbatimBlockCommands;
>> <snip>
>> +  /// Registered verbatim-like line commands.
>> +  VerbatimLineCommandVector VerbatimLineCommands;
>> <snip>
>> +  /// \brief Register a new verbatim block command.
>> +  void addVerbatimBlockCommand(StringRef BeginName, StringRef EndName);
>> +
>> +  /// \brief Register a new verbatim line command.
>> +  void addVerbatimLineCommand(StringRef Name);
>> 
>> Do we need this much generality? It seems like a StringSwitch for each case (verbatim block and verbatim line) would suffice for now, and we could optimize that later by TableGen'ing the string matcher for the various Doxygen/HeaderDoc/Qt/JavaDoc command names.
> 
> Converted to StringSwitch, but retained the vector search.  I think we
> need both: string matcher to handle predefined commands and a vector
> to handle commands registered dynamically.

Are we expecting that commands registered dynamically will have their own parsers that build custom AST nodes?

> But the StringSwitch should go because upcoming parsing and semantic
> analysis know command properties in detail.  Thus, command lists would
> be duplicated here and there.  Will need to eventually Tablegen this.

Yes, we'll want to tablegen this, for performance if nothing else. Obviously, it doesn't have to happen now.

>> +  while (TokenPtr != CommentEnd) {
>> +    switch(*TokenPtr) {
>> +      case '\\':
>> +      case '@': {
>> +        TokenPtr++;
>> +        if (TokenPtr == CommentEnd) {
>> +          formTokenWithChars(T, TokenPtr, tok::text);
>> +          T.setText(StringRef(BufferPtr - T.getLength(), T.getLength()));
>> +          return;
>> +        }
>> +        char C = *TokenPtr;
>> +        if (StringRef("\\@&$#<>%\".:").find(C) != StringRef::npos) {
>> 
>> This 'if' would almost surely be more efficient as a switch. More importantly, it should only apply to the '\\' case, and not the '@' case.
> 
> (a) Converted to switch. (b) Doxygen allows @ as escape character, for
> example @\c will yield \c.

I didn't know that, thanks!

>> Also, I suspect that the various \f[…] and \f(…) commands actually to paren-, brace-, and bracket-matching internally. Please add a FIXME here so we don't forget to deal with this later (but, of course, it's not at all important to solve at this point!).
> 
> Doxygen doesn't check that.  And actually it might be harmful to do so
> because a complex LaTeX formula (with \phantom, for example) may have
> paren imbalance.

Good point.
> 
>> +      default: {
>> +        while (true) {
>> +          TokenPtr++;
>> +          if (TokenPtr == CommentEnd)
>> +            break;
>> +          char C = *TokenPtr;
>> +          if(C == '\n' || C == '\r' ||
>> +             C == '\\' || C == '@' || C == '<')
>> +            break;
>> +        }
>> 
>> I hadn't realized it before, but now I see in the code that the string tokens returned by the lexer are not 'maximal', in the sense that they don't necessarily cover an entire string. For example, something like "foo \& bar" comes in as three tokens, "foo ", "&", and " bar". That makes
> 
> Seems like the sentence ends unexpectedly here, but I got the point.
> But returning maximal tokens will lead to memory allocation for
> unescaped string (currently we get away with returning pointers into
> source file).

That makes sense.

>> +    case '*': { // C comment.
>> +      BufferPtr++; // Skip star.
>> +
>> +      // Skip Doxygen magic marker.
>> +      const char C = *BufferPtr;
>> +      if ((C == '*' && *(BufferPtr + 1) != '/') || C == '!')
>> +        BufferPtr++;
>> +
>> +      // Skip less-than symbol that marks trailing comments.
>> +      if (BufferPtr != CommentEnd && *BufferPtr == '<')
>> +        BufferPtr++;
>> 
>> Should we (somewhere, at some point) eliminate lines of all '*'s in C comments, e.g.,
>> 
>> /**************
>>  * Docs
>>  **************/
>> 
>> ?
>> 
>> This seems not at all important now, but we should consider it for the future. (FIXME somewhere, perhaps).
> 
> I think that it should be done while post-processing the comment text
> (in CommentSema, i think).  Actually, this looks very much like
> markdown 'rulers'.

Okay.

Updated patch looks great!

	- Doug