[cfe-commits] [PATCH] Structured comment parsing, retaining comments in AST

Tue Jun 19 10:47:27 PDT 2012

On Jun 15, 2012, at 3:59 PM, Dmitri Gribenko wrote:

> Hello,
> 
> Here is a new version of the patch, which addresses all the comments,
> except for one: I did not implement the optimized per-module search
> for comments.  Unfortunately, I did not found any documentation about
> how modules work in clang (and what are global, local IDs, submodules
> and how they differ from modules etc.)

Unfortunately, there is no documentation. I can help here.

> Changes from v1 include:
> 
> * new library libComments, which will contain all related lexing and
> parsing code;

The dependency here is that the AST library now depends on the Comments library. I think that's reasonable, because we want to have functions on the AST that extract comments from the corresponding declarations.

> * class RawComment, which represents a single comment;
> 
> * class RawCommentList, which is populated during parsing.  It merges
> consecutive comments on the fly.

Makes sense.

> * implements a warning "not a Doxygen member comment", which should
> catch typos //< and /*< (instead of ///< and /**<).  We had a few in
> our codebase, which I fixed manually: r158241, r158248.

Can you separate this warning into a subsequent patch? I'd rather get the basics in place before we start introducing warnings.

> Questions:
> * should I split RawCommentList.h into RawComment.h and RawCommentList.h?

No, I think it's fine this way.

> * should I remove the LANGOPT ParseComments?

Yes, please. The overhead is low enough that we can afford to at least track the comments in all compiles.

> Next steps:
> * Implementing lexer, parser and AST classes for the comment text.
> * Expose comment AST through libclang.
> * We will need semantic analysis as well, for example, to match \param
> names with actual function parameters.


Sounds great!

Some specific comments:

Index: include/clang/Basic/DiagnosticSemaKinds.td
===================================================================

--- include/clang/Basic/DiagnosticSemaKinds.td	(revision 158558)
+++ include/clang/Basic/DiagnosticSemaKinds.td	(working copy)
@@ -5633,5 +5633,10 @@
   "definition of %0 must be imported before it is required">;
 }
 
+let CategoryName = "Comment Issue" in {
+def warn_not_a_doxygen_after_member_comment : Warning<
+  "not a Doxygen member comment">, InGroup<DiagGroup<"doxygen">>;
+} // end of comment issue category
+
 } // end of sema component.
 
Please define a warning group in DiagnosticGroups.td for Doxygen warnings; I suspect we'll have a number of these. 

Is this simply a "Doxygen Issue" or "Documentation Issue", rather than a "Comment Issue"?

+class RawComment {
+public:
+  enum CommentKind {
+    CK_Invalid,      ///< Invalid comment
+    CK_OrdinaryBCPL, ///< Any normal BCPL comments
+    CK_OrdinaryC,    ///< Any normal C comment
+    CK_BCPLSlash,    ///< \code /// stuff \endcode
+    CK_BCPLExcl,     ///< \code //! stuff \endcode
+    CK_JavaDoc,      ///< \code /** stuff */ \endcode
+    CK_Qt,           ///< \code /*! stuff */ \endcode
+    CK_Merged        ///< Two or more Doxygen comments merged together
+  };
+

Note that /*! is also used by HeaderDoc; could you add that to the comment so we don't forget it?

+  SourceRange Range;
+  CommentKind Kind;
+  bool IsAfterMember;
+  bool IsAlmostAfterMemberComment;
+  StringRef RawText;
+
+  mutable unsigned BeginLine; ///< Cached line number
+  mutable unsigned EndLine; ///< Cached line number
+  mutable bool BeginLineValid; ///< True if BeginLine is valid
+  mutable bool EndLineValid; ///< True if EndLine is valid

Please pack the various integer/bool/enum data fields into bitfields, because the size of RawComment does have an impact on the size of the AST.

+  /// \brief All comments in this translation unit.
+  RawCommentList Comments;
+  //DenseMap<unsigned, RawCommentList> Comments;

This commented-out DenseMap can go away?

+  /// \brief Cursors for comments blocks.
+  SmallVector<std::pair<llvm::BitstreamCursor,
+                        serialization::ModuleFile *>, 8> CommentsCursors;
+

This belongs in the Serialization library; the AST itself shouldn't know anything about cursors or the serialization format. I should call back via the ExternalASTSource when it needs information. The "CommentsLoaded" bit is fine, though; we do that often to optimize away extraneous calls into the external AST source.

+  /// \brief Return the Doxygen-style comment attached to a given declaration,
+  /// without looking into cache.
+  const RawComment *getRawCommentForDeclNoCache(const Decl *D) const;

It's just a documentation comment, not specifically a Doxygen comment, right?

 /**
+ * \brief Given a cursor that represents a declaration, return the associated
+ * comment's source range.
+ */
+CINDEX_LINKAGE CXSourceRange clang_Cursor_getCommentRange(CXCursor C);
+

Should this comment say something about this range covering multiple adjacent comments?

+  // First check whether we have a comment that comes after a member decl.
+  if (Comment != RawComments.end() &&
+      Comment->isDoxygen() && Comment->isAfterMember() &&
+      !isa<TagDecl>(D) && !isa<NamespaceDecl>(D)) {

Why the checks against tags and namespaces? Should this also/instead be checking isa<TagDecl>(D->getDeclContext()) || isa<ObjCContainerDecl>(D->getDeclContext())?

+        case COMMENTS_RAW_COMMENT: {
+          unsigned Idx = 0;
+          SourceRange SR = ReadSourceRange(F, Record, Idx);
+          bool Merged = Record[Idx++];
+          Context.addComment(RawComment(SourceMgr, SR, Merged));
+          break;

This is going to cause us to pull in the contents of each of the files where a raw comment occurred, as soon as we load the raw comments from the AST file. That's going to be rather expensive, I suspect. Alternatively, we could store more in the AST file (including the Kind, IsAfterMember, and IsAlmostAfterMemberComment values) and lazily compute RawText only when it's actually requested... which should be very rare.

Also, I think you want to put all of the comments read from the AST file at the beginning of the RawCommentList, because right now they could end up getting mixed together with comments collecting while parsing the current translation unit. This way, the ordering will be correct for precompiled headers. (Modules are a separable issue)

This is looking great!

	- Doug