[cfe-commits] [PATCH] Structured comment parsing, retaining comments in AST

Tue Jun 19 16:36:40 PDT 2012

Hi Doug,

Thank you for the review.  Responses inline.  Patch attached.

On Tue, Jun 19, 2012 at 10:47 AM, Douglas Gregor <dgregor at apple.com> wrote:
>> * implements a warning "not a Doxygen member comment", which should
>> catch typos //< and /*< (instead of ///< and /**<).  We had a few in
>> our codebase, which I fixed manually: r158241, r158248.
>
> Can you separate this warning into a subsequent patch? I'd rather get the basics in place before we start introducing warnings.

Removed this warning from this patch.

>> * should I remove the LANGOPT ParseComments?
>
> Yes, please. The overhead is low enough that we can afford to at least track the comments in all compiles.

Done.

> Index: include/clang/Basic/DiagnosticSemaKinds.td
> ===================================================================
> --- include/clang/Basic/DiagnosticSemaKinds.td  (revision 158558)
> +++ include/clang/Basic/DiagnosticSemaKinds.td  (working copy)
> @@ -5633,5 +5633,10 @@
>   "definition of %0 must be imported before it is required">;
>  }
>
> +let CategoryName = "Comment Issue" in {
> +def warn_not_a_doxygen_after_member_comment : Warning<
> +  "not a Doxygen member comment">, InGroup<DiagGroup<"doxygen">>;
> +} // end of comment issue category
> +
>  } // end of sema component.
>
> Please define a warning group in DiagnosticGroups.td for Doxygen warnings; I suspect we'll have a number of these.
>
> Is this simply a "Doxygen Issue" or "Documentation Issue", rather than a "Comment Issue"?

We can say that it is a "Documentation issue", thus side-stepping the
issue of hardcoding any particular names into clang.

> +class RawComment {
> +public:
> +  enum CommentKind {
> +    CK_Invalid,      ///< Invalid comment
> +    CK_OrdinaryBCPL, ///< Any normal BCPL comments
> +    CK_OrdinaryC,    ///< Any normal C comment
> +    CK_BCPLSlash,    ///< \code /// stuff \endcode
> +    CK_BCPLExcl,     ///< \code //! stuff \endcode
> +    CK_JavaDoc,      ///< \code /** stuff */ \endcode
> +    CK_Qt,           ///< \code /*! stuff */ \endcode
> +    CK_Merged        ///< Two or more Doxygen comments merged together
> +  };
> +
>
> Note that /*! is also used by HeaderDoc; could you add that to the comment so we don't forget it?

Done.

> +  SourceRange Range;
> +  CommentKind Kind;
> +  bool IsAfterMember;
> +  bool IsAlmostAfterMemberComment;
> +  StringRef RawText;
> +
> +  mutable unsigned BeginLine; ///< Cached line number
> +  mutable unsigned EndLine; ///< Cached line number
> +  mutable bool BeginLineValid; ///< True if BeginLine is valid
> +  mutable bool EndLineValid; ///< True if EndLine is valid
>
> Please pack the various integer/bool/enum data fields into bitfields, because the size of RawComment does have an impact on the size of the AST.

Packed bool and enum fields together, but didn't touch unsigned
members because they are line numbers.

> +  /// \brief All comments in this translation unit.
> +  RawCommentList Comments;
> +  //DenseMap<unsigned, RawCommentList> Comments;
>
> This commented-out DenseMap can go away?

Done.

> +  /// \brief Cursors for comments blocks.
> +  SmallVector<std::pair<llvm::BitstreamCursor,
> +                        serialization::ModuleFile *>, 8> CommentsCursors;
> +
>
> This belongs in the Serialization library; the AST itself shouldn't know anything about cursors or the serialization format. I should call back via the ExternalASTSource when it needs information. The "CommentsLoaded" bit is fine, though; we do that often to optimize away extraneous calls into the external AST source.

Moved CommentsCursors from ASTContext to ASTReader.

> +  /// \brief Return the Doxygen-style comment attached to a given declaration,
> +  /// without looking into cache.
> +  const RawComment *getRawCommentForDeclNoCache(const Decl *D) const;
>
> It's just a documentation comment, not specifically a Doxygen comment, right?
>
>  /**
> + * \brief Given a cursor that represents a declaration, return the associated
> + * comment's source range.
> + */
> +CINDEX_LINKAGE CXSourceRange clang_Cursor_getCommentRange(CXCursor C);
> +
>
> Should this comment say something about this range covering multiple adjacent comments?

Rephrased:
/**
 * \brief Given a cursor that represents a declaration, return the associated
 * comment's source range.
 */
CINDEX_LINKAGE CXSourceRange clang_Cursor_getCommentRange(CXCursor C);

/**
 * \brief Given a cursor that represents a declaration, return the associated
 * comment text, including comment markers.  The range may include multiple
 * consecutive comments with whitespace in between.
 */
CINDEX_LINKAGE CXString clang_Cursor_getRawCommentText(CXCursor C);

> +  // First check whether we have a comment that comes after a member decl.
> +  if (Comment != RawComments.end() &&
> +      Comment->isDoxygen() && Comment->isAfterMember() &&
> +      !isa<TagDecl>(D) && !isa<NamespaceDecl>(D)) {
>
> Why the checks against tags and namespaces? Should this also/instead be checking isa<TagDecl>(D->getDeclContext()) || isa<ObjCContainerDecl>(D->getDeclContext())?

The reason for this check is that we don't allow TagDecls and
NamespaceDecls to have "after member" ///< comments.  (Although --
just checked -- Doxygen allows them for TagDecls.  But that's
confusing to write in source, I don't think it is logical.)

> +        case COMMENTS_RAW_COMMENT: {
> +          unsigned Idx = 0;
> +          SourceRange SR = ReadSourceRange(F, Record, Idx);
> +          bool Merged = Record[Idx++];
> +          Context.addComment(RawComment(SourceMgr, SR, Merged));
> +          break;
>
> This is going to cause us to pull in the contents of each of the files where a raw comment occurred, as soon as we load the raw comments from the AST file. That's going to be rather expensive, I suspect. Alternatively, we could store more in the AST file (including the Kind, IsAfterMember, and IsAlmostAfterMemberComment values) and lazily compute RawText only when it's actually requested... which should be very rare.

Done.

> Also, I think you want to put all of the comments read from the AST file at the beginning of the RawCommentList, because right now they could end up getting mixed together with comments collecting while parsing the current translation unit. This way, the ordering will be correct for precompiled headers. (Modules are a separable issue)

Done.

Question:

* "After member comment" sounds awkward (but that's how it is called
once in Doxygen manual).  Maybe "trailing comment" is better?

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: retain-comments-in-ast-v3.patch
Type: application/octet-stream
Size: 53370 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120619/9aae4401/attachment.obj>