[cfe-commits] [PATCH] Structured comment parsing, retaining comments in AST

Douglas Gregor dgregor at apple.com
Mon Jun 11 08:32:28 PDT 2012


On Jun 8, 2012, at 1:46 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:

> Hello,
> 
> I'm working on getting clang parse Doxygen comments and expose them in
> AST, via libclang APIs etc.
> 
> The first step is to save comments during parsing.  Most of this work
> was already done by Doug Gregor, [1] but it was reverted because it
> was not used. [2]  I modified this patch so that it applies to ToT.
> 
> The patch exposes raw comment text via libclang so that the feature
> can be tested with c-index-test.

Great!

> Other modifications include:
> * Introduced LangOpt.ParseComments.  As we don't want to regress on
> parsing time or increase memory footprint during normal compilation,
> comments are only saved when this option is enabled.

Did you measure any impact on parsing time or memory footprint? Back when I committed [1], the only cost I had found was the increase in PCH size. I would very much like comments to always be available in the AST, and if the cost to performance and memory use is negligible (which I suspect it is), I'd rather not have a flag for "keep comments" at all because it makes it harder to actually use this feature.  As for the PCH size regression noted in the removal [2], I suspect that we could improve on that a bit by combining the source ranges of adjacent // comments into a single comment.

Now, we may eventually want to actually look at the contents of those comments, and *that* is very likely to cost us performance. However, that's the kind of thing that would affect warnings or other tools, not fundamentally change the AST, so I expect those to be under the control of various flags.  

>  I made this
> option a LANGOPT, but maybe it should be a BENIGN_LANGOPT?

If we keep it, it should be a BENIGN_LANGOPT, because the presence or absence of comments does not affect the language semantics or ASTs in some incompatible way.

> * Reworked AST serialization/deserialization because approach to that
> has changed since [1] was written.
> 
> * Renamed Preprocessor::{Add,Remove}CommentHandler to match coding standards.
> 
> * Added lots of tests.

Some comments:

@@ -375,7 +375,15 @@
 
   const TargetInfo *Target;
   clang::PrintingPolicy PrintingPolicy;
-  
+
+  /// \brief Source ranges for all of the comments in the source file,
+  /// sorted in order of appearance in the translation unit.
+  std::vector<SourceRange> CommentSourceRanges;
+
+  /// \brief Mapping from declarations to their comments, once we have
+  /// already looked up the comment associated with a given declaration.
+  llvm::DenseMap<const Decl *, std::string> DeclComments;
+

I think the value type be StringRef, pointing into ASTContext-allocated memory, so that we can avoid separate allocations for each documentation string we query. As a micro-optimization, this would also allow the StringRef to simply point back into the mmap'd file in the cases where we have a single-line comment, e.g,

	/// My favorite function

Of course, we'd still need a copy for multi-line comments like this

  /**
     *
     */

but I bet we'd still get a decent savings from optimizing the single-line case.

+  /// \brief Return the Doxygen-style comment attached to a given declaration.
+  const char *getCommentForDecl(const Decl *D);

This should also return a StringRef, which will be safe once you're using StringRefs for the DeclComments data structure.

/**
+ * \brief Given a cursor that represents a declaration, return the associated
+ * comment.
+ */
+CINDEX_LINKAGE CXString clang_Cursor_getComment(CXCursor C);

Given that we're going to want to provide a function that returns a structured representation of the comment, and that function is what we will want people to be using most of the time, can we put either "raw" or "unprocessed" or "unparsed" into the name somewhere?

	CINDEX_LINKAGE CXString clang_cursor_getRawComment(CXCursor C);

?


@@ -398,6 +402,19 @@
       if (!clang_equalRanges(CursorExtent, RefNameRange))
         PrintRange(RefNameRange, "RefName");
     }
+
+    Comment = clang_Cursor_getComment(Cursor);
+    CommentCString = clang_getCString(Comment);
+    if (CommentCString != NULL && CommentCString[0] != '\0') {
+      printf(" [Comment=");
+      for ( ; *CommentCString; ++CommentCString) {
+        if (*CommentCString != '\n')
+          putchar(*CommentCString);
+        else
+          printf("\\n");
+      }
+      printf("]");
+    }

You'll want a clang_disposeString() call once you're done with CommentCString.

+
+CXString clang_Cursor_getComment(CXCursor C) {
+  if (!clang_isDeclaration(C.kind))
+    return createCXString((const char *) NULL);
+
+  const Decl *D = getCursorDecl(C);
+  ASTContext &Context = getCursorContext(C);
+
+  return createCXString(Context.getCommentForDecl(D));
+}

… and with ASTContext:: getCommentForDecl() returning a StringRef into memory owned by the ASTContext, you can pass DupString=false for this call to createCXString, eliminating the extra copy here. 

+namespace {
+/// \breif Compare two non-overlapping source ranges.
+class BeforeInTranslationUnit {
+  SourceManager *SourceMgr;
+
+public:
+  explicit BeforeInTranslationUnit(SourceManager *SM) : SourceMgr(SM) { }
+
+  bool operator()(SourceRange X, SourceRange Y) {
+    return SourceMgr->isBeforeInTranslationUnit(X.getBegin(), Y.getBegin());
+  }
+};

Typo "\breif". Don't we have one of these function objects around already?

+  // If the declaration doesn't map directly to a location in a file, we
+  // can't find the comment.
+  // TODO: use the location of the identifier being declared, this way we can
+  // find after-member comments more precisely when there are multiple
+  // identifiers declared in one declaration.
+  SourceLocation DeclLoc = D->getLocStart();
+  if (DeclLoc.isInvalid() || !DeclLoc.isFileID())
+    return NULL;

Regarding this TODO, if you simply use D->getLocation(), it will return the location of the name of the declaration (if there is one), which may just do what you want. There are some advantages to using the starting location, though, because it's often on the line before the name

	int
	X::foo

The trick will be to handle both this and the multi-declarator case well.

Index: lib/Serialization/ASTReader.cpp
===================================================================
--- lib/Serialization/ASTReader.cpp	(revision 158158)
+++ lib/Serialization/ASTReader.cpp	(working copy)
@@ -2426,7 +2426,14 @@
       }
       break;
     }
+
+    case COMMENT_SOURCE_RANGES: {
+      Context.CommentSourceRanges.reserve(Record.size());
+      for (unsigned Idx = 0; Idx < Record.size(); )
+        Context.CommentSourceRanges.push_back(ReadSourceRange(F, Record, Idx));
+      break;
     }
+    }
   }
   Error("premature end of bitstream in AST file");
   return Failure;

Please make the deserialization of comment ranges lazy, so that we don't pay the cost of deserializing comments unless we actually use them. There are two typical approaches to doing this within the AST reader/writer:

	a) Write out the comment source range data array as a single blob, which essentially just dumps the bytes of the array into the AST file. The reader then gets them mmap'd in as an array. This is good for large arrays where the cost of deserialization matters and we're prone to random access (e.g., the mapping from declaration ID to its location in the file), but it takes a lot of space in the AST file.

	b) Introduce a new block for the comment source range data, and when we initially see that block, stash its cursor somewhere. Later on, when we need that information, use the cursor to actually deserialize the contents of the block. It's still lazy.

I think (b) is best for this application, because we don't want to waste tons of space in the AST file for something that isn't going to be used by all clients (in particular, the compiler proper doesn't need it). 

	- Doug



More information about the cfe-commits mailing list