[cfe-commits] [PATCH] Structured comment parsing, retaining comments in AST

Mon Jun 11 09:06:32 PDT 2012

Hi, Dmitri. Some comments on various parts of this approach:

> +  /// \brief Source ranges for all of the comments in the source file,
> +  /// sorted in order of appearance in the translation unit.
> +  std::vector<SourceRange> CommentSourceRanges;
> +
> +  /// \brief Mapping from declarations to their comments, once we have
> +  /// already looked up the comment associated with a given declaration.
> +  llvm::DenseMap<const Decl *, std::string> DeclComments;

I'm wondering what the rationale was behind having separate strings and source ranges here, rather than first-class Comment objects (and a map from Decl to Comment). I guess this way is a little lighter, but as we add functionality that might become a hindrance.

> @@ -389,7 +397,7 @@
>    void setPrintingPolicy(clang::PrintingPolicy Policy) {
>      PrintingPolicy = Policy;
>    }
> -  
> +
>    SourceManager& getSourceManager() { return SourceMgr; }
>    const SourceManager& getSourceManager() const { return SourceMgr; }
>    void *Allocate(unsigned Size, unsigned Align = 8) const {

Please watch for whitespace-only changes, even ones conforming to the LLVM standard.

> +private:
> +  bool isDoxygenComment(SourceRange Comment, bool Member = false);
> +

Unless this is needed in more than one source file, consider making this a static function instead of a private method, to keep the header file as trim as possible.

> +  /// \brief Return the Doxygen-style comment attached to a given declaration.
> +  const char *getCommentForDecl(const Decl *D);
> +

This should probably use StringRef.

> +    Comment = clang_Cursor_getComment(Cursor);
> +    CommentCString = clang_getCString(Comment);
> +    if (CommentCString != NULL && CommentCString[0] != '\0') {
> +      printf(" [Comment=");
> +      for ( ; *CommentCString; ++CommentCString) {
> +        if (*CommentCString != '\n')
> +          putchar(*CommentCString);
> +        else
> +          printf("\\n");
> +      }
> +      printf("]");
> +    }

Is \n the only thing worth escaping here? Consider tab, control characters, Unicode, etc. Probably we should just print this in the same way we print string literals (whatever we do for that).

> +  bool Invalid = false;
> +  const char *BufferStart
> +    = SourceMgr.getBufferData(SourceMgr.getFileID(Comment.getBegin()),
> +                              &Invalid).data();
> +  if (Invalid)
> +    return false;
> +
> +  const char *Start = BufferStart + SourceMgr.getFileOffset(Comment.getBegin());
> +  const char *End = BufferStart + SourceMgr.getFileOffset(Comment.getEnd());

I don't know if we currently support having a file broken up into chunks, but you can ask for the part of the file you need a lot more easily like this:

SourceMgr.getCharacterData(Comment.getBegin(), &Invalid);

I'm not sure offhand how to test that the end is at least three characters past the start, though.

> +namespace {
> +/// \breif Compare two non-overlapping source ranges.
> +class BeforeInTranslationUnit {

Typo: \brief. But there's a version of this for SourceLocations already in SourceManager.h (LocBeforeThanCompare). Maybe you should just templatize that.

> +    if (DeclLocDecomp.first == CommentBeginDecomp.first &&
> +        SourceMgr.getLineNumber(DeclLocDecomp.first, DeclLocDecomp.second)
> +          == SourceMgr.getLineNumber(CommentBeginDecomp.first,
> +                                     CommentBeginDecomp.second)) {

Instead of computing both line numbers, how about just using std::find to see if there are any newlines between the decl and the comment? One tricky thing is that both \n and \r count as newlines.

> +  // Compute the starting line for the declaration and for the end of the
> +  // comment (this is expensive).
> +  unsigned DeclLocLine
> +    = SourceMgr.getLineNumber(DeclLocDecomp.first, DeclLocDecomp.second);
> +  unsigned CommentEndLine
> +    = SourceMgr.getLineNumber(CommentEndDecomp.first,
> +                              CommentEndDecomp.second);

Ditto.

> +  // If the comment does not end on the line prior to the declaration, then
> +  // the comment is not associated with the declaration at all.
> +  // TODO: check if Doxygen does the same.
> +  if (CommentEndLine + 1 != DeclLocLine)
> +    return NULL;

I am pretty sure this is not the case; I think (from memory) that both of these count as attached:

/// \brief A class.

class A {};

///\brief Another class
// TODO: remove this!
class B {};

> +    Result.append(FileBufferStart + DecompStart.second,
> +                  FileBufferStart + DecompEnd.second + 1);

This includes all the slashes and/or continuation stars, yes? Eventually we'll probably want to remove those from the middle of the buffer. Also, Doxygen considers blank lines to be significant, IIRC.

Waiting to see where this goes!
Jordan