[cfe-commits] [PATCH] Lexer for structured comments

Mon Jun 25 10:45:16 PDT 2012

On Jun 22, 2012, at 3:01 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:

> On Fri, Jun 22, 2012 at 11:21 AM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
>> Hello,
>> 
>> The attached patch implements a lexer for structured comments.  Highlights:
>> 
>> * The lexer is Doxygen-oriented.  I know that we should be loose in
>> what we accept in comments, but I mainly focused on Doxygen
>> constructs.
>> 
>> * Embedded HTML lexer produces very "fat" tokens, e.g., "<tag>" or
>> 'attr="value"' are single tokens.  This can be changed later to more
>> fine-grained tokens if needed.
>> 
>> * Markdown and automatic link generation (to function declarations)
>> are not supported.  I think that these should be added at parser level
>> with regexps.
>> 
>> Please review.
> 
> Here's an updated patch that includes a simple parser.  BriefParser
> extracts the \brief paragraph if it can find it.  Otherwise it returns
> the first paragraph.
> 
> There is also a new libclang API to do that:
> CXString clang_Cursor_getBriefCommentText(CXCursor C);
> 
> Relevant modifications also done to c-index-test so that this feature
> is testable (tests also added).
> 
> Please review.


Index: include/clang/AST/RawCommentList.h
===================================================================

--- include/clang/AST/RawCommentList.h	(revision 159034)
+++ include/clang/AST/RawCommentList.h	(working copy)
@@ -91,12 +91,22 @@
   unsigned getBeginLine(const SourceManager &SM) const;
   unsigned getEndLine(const SourceManager &SM) const;
 
+  StringRef getBriefText(const SourceManager &SourceMgr) const {
+    if (BriefTextValid)
+      return BriefText;
+
+    return extractBriefText(SourceMgr);
+  }
+
 private:
   SourceRange Range;
 
   mutable StringRef RawText;
-  mutable bool RawTextValid : 1; ///< True if RawText is valid
+  mutable std::string BriefText;

Should BriefText be ASTContext-allocated, so that RawComment can stay POD-like?

+namespace tok {
+enum TokenKind {
+  eof,
+  newline,
+  text,
+  command,
+  verbatim_block_begin,
+  verbatim_block_line,
+  verbatim_block_end,
+  verbatim_line,
+  html_tag_open_name,
+  html_tag_open_attr,

I find the choice of html_tag_open_name/html_tag_open_attr interesting. Presumably, you're doing this so that we don't need to have '=' and string literals as token kinds (because we generally don't want them as token kinds), and that makes sense. However, shouldn't there be an html_tag_open_end token as well, to capture the '>'? Otherwise, the parser has to either intuit the ending tag based on having a non-html_tag_open* token or poke at the last character in each html_tag_open_* token.

+  /// Contains text value associated with a token.
+  StringRef TextValue1;
+
+  /// Contains text value associated with a token.
+  StringRef TextValue2;

It would be nice if Token were a POD type. Could we just store ptr/length for TextValue1 and TextValue2, so that Token remains a POD?

I see that you don't have location information for the '=' or for the value in TextValue2, although I guess you can intuit it from the length?

Index: tools/libclang/CIndex.cpp
===================================================================
--- tools/libclang/CIndex.cpp	(revision 159034)
+++ tools/libclang/CIndex.cpp	(working copy)
@@ -5707,7 +5707,25 @@
 
 } // end: extern "C"
 
+CXString clang_Cursor_getBriefCommentText(CXCursor C) {
+  if (!clang_isDeclaration(C.kind))
+    return createCXString((const char *) NULL);
 
+  const Decl *D = getCursorDecl(C);
+  ASTContext &Context = getCursorContext(C);
+  const RawComment *RC = Context.getRawCommentForDecl(D);
+
+  if (RC && RC->isDoxygen()) {

I was about to ask "Why shouldn't this work with JavaDoc or Qt/HeaderDoc comments?", but the only issue here is that "isDoxygen()" isn't a good name choice. How about simply "isDocumentation()"?

+//===- unittests/AST/CommentLexer.cpp ------ Comment lexer tests ----------===//
+//

Thank you!

+  /// Registered verbatim-like block commands.
+  VerbatimBlockCommandVector VerbatimBlockCommands;
<snip>
+  /// Registered verbatim-like line commands.
+  VerbatimLineCommandVector VerbatimLineCommands;
<snip>
+  /// \brief Register a new verbatim block command.
+  void addVerbatimBlockCommand(StringRef BeginName, StringRef EndName);
+
+  /// \brief Register a new verbatim line command.
+  void addVerbatimLineCommand(StringRef Name);

Do we need this much generality? It seems like a StringSwitch for each case (verbatim block and verbatim line) would suffice for now, and we could optimize that later by TableGen'ing the string matcher for the various Doxygen/HeaderDoc/Qt/JavaDoc command names.

+  while (TokenPtr != CommentEnd) {
+    switch(*TokenPtr) {
+      case '\\':
+      case '@': {
+        TokenPtr++;
+        if (TokenPtr == CommentEnd) {
+          formTokenWithChars(T, TokenPtr, tok::text);
+          T.setText(StringRef(BufferPtr - T.getLength(), T.getLength()));
+          return;
+        }
+        char C = *TokenPtr;
+        if (StringRef("\\@&$#<>%\".:").find(C) != StringRef::npos) {

This 'if' would almost surely be more efficient as a switch. More importantly, it should only apply to the '\\' case, and not the '@' case.

+        // Hardcoded support for lexing \f$ \f[ \f] \f{ \f} as a single
+        // command.
+        if (Length == 1 && TokenPtr[-1] == 'f' && TokenPtr != CommentEnd) {
+          C = *TokenPtr;
+          if (C == '$' || C == '[' || C == ']' || C == '{' || C == '}') {
+            TokenPtr++;
+            Length++;
+          }
+        }
+

Could you mention the term 'formula' here, so the reader knows what the heck '\f' actual means? Also, I suspect that the various \f[…] and \f(…) commands actually to paren-, brace-, and bracket-matching internally. Please add a FIXME here so we don't forget to deal with this later (but, of course, it's not at all important to solve at this point!).

+        if (isVerbatimBlockCommand(CommandName, EndName)) {
+          setupAndLexVerbatimBlock(T, TokenPtr, *BufferPtr, EndName);
+          return;
+        }
+        else if (isVerbatimLineCommand(CommandName)) {
+          lexVerbatimLine(T, TokenPtr);
+          return;
+        } else {
+          formTokenWithChars(T, TokenPtr, tok::command);
+          T.setCommandName(CommandName);
+          return;
+        }

Since each of the 'if' branches returns, you can remove the else's and de-nest the last bit, e.g.,

+        if (isVerbatimBlockCommand(CommandName, EndName)) {
+          setupAndLexVerbatimBlock(T, TokenPtr, *BufferPtr, EndName);
+          return;
+        }
+       if (isVerbatimLineCommand(CommandName)) {
+          lexVerbatimLine(T, TokenPtr);
+          return;
+        }
+        formTokenWithChars(T, TokenPtr, tok::command);
+        T.setCommandName(CommandName);
+        return;

+      default: {
+        while (true) {
+          TokenPtr++;
+          if (TokenPtr == CommentEnd)
+            break;
+          char C = *TokenPtr;
+          if(C == '\n' || C == '\r' ||
+             C == '\\' || C == '@' || C == '<')
+            break;
+        }

I hadn't realized it before, but now I see in the code that the string tokens returned by the lexer are not 'maximal', in the sense that they don't necessarily cover an entire string. For example, something like "foo \& bar" comes in as three tokens, "foo ", "&", and " bar". That makes 

+void Lexer::lexVerbatimBlockFirstLine(Token &T) {
+  assert(BufferPtr < CommentEnd);
+
+  // Extract current line.
+  const char *Newline = findNewline(BufferPtr, CommentEnd);
+  StringRef Line(BufferPtr, Newline - BufferPtr);
+
+  // Look for end command in current line.
+  size_t Pos = Line.find(VerbatimBlockEndCommandName);
+  const char *NextLine;

Performance-wise, it would be better for us to only scan the line text once, finding either the verbatim block-end command or the location of the newline. For now, a FIXME will suffice.

+void Lexer::lex(Token &T) {
+again:
+  switch (CommentState) {
+  case LCS_BeforeComment:
+    if (BufferPtr == BufferEnd) {
+      formTokenWithChars(T, BufferPtr, tok::eof);
+      return;
+    }
+
+    assert(*BufferPtr == '/');
+    BufferPtr++; // Skip first slash.
+    switch(*BufferPtr) {
+    case '/': { // BCPL comment.
+      BufferPtr++; // Skip second slash.
+
+      if (BufferPtr != BufferEnd) {
+        // Skip Doxygen magic marker.
+        const char C = *BufferPtr;
+        if (C == '/' || C == '!')
+          BufferPtr++;
+      }

Presumably, we have to perform this last 'if' check because the third '/' might be missing when '///<' is typo'd to '//<'? A comment here might help.

+    case '*': { // C comment.
+      BufferPtr++; // Skip star.
+
+      // Skip Doxygen magic marker.
+      const char C = *BufferPtr;
+      if ((C == '*' && *(BufferPtr + 1) != '/') || C == '!')
+        BufferPtr++;
+
+      // Skip less-than symbol that marks trailing comments.
+      if (BufferPtr != CommentEnd && *BufferPtr == '<')
+        BufferPtr++;

Should we (somewhere, at some point) eliminate lines of all '*'s in C comments, e.g.,

/**************
 * Docs
 **************/

? 

This seems not at all important now, but we should consider it for the future. (FIXME somewhere, perhaps).

+    // Turn any whitespace between comments (and there is only whitespace
+    // between them) into a newline.  We have two newlines between comments
+    // in total (first one was synthenized after a comment).

Typo "synthenized"

+        // Synthenize newline just after the C comment, regardless if there is
+        // actually a newline.

Same typo here.

This is looking great, thanks!

	- Doug