[cfe-commits] [PATCH] Structured comment parsing, retaining comments in AST

Tue Jun 12 18:32:06 PDT 2012

Hi Doug,

Thank you for comments.  Responses inline.

On Mon, Jun 11, 2012 at 8:32 AM, Douglas Gregor <dgregor at apple.com> wrote:
> On Jun 8, 2012, at 1:46 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
>> Other modifications include:
>> * Introduced LangOpt.ParseComments.  As we don't want to regress on
>> parsing time or increase memory footprint during normal compilation,
>> comments are only saved when this option is enabled.
>
> Did you measure any impact on parsing time or memory footprint? Back when I committed [1], the only cost I had found was the increase in PCH size. I would very much like comments to always be available in the AST, and if the cost to performance and memory use is negligible (which I suspect it is), I'd rather not have a flag for "keep comments" at all because it makes it harder to actually use this feature.  As for the PCH size regression noted in the removal [2], I suspect that we could improve on that a bit by combining the source ranges of adjacent // comments into a single comment.

I just did some unscientific measurements with two files: one
containing 10,000 lines of /// comments (500 Kb), second containing
100,000 of them (5 Mb).

In the small file case I see 5-10% increase in compile time.  For the
second I see x4 increase in compile time.  In the second case most of
that time is spent in vector::push_back.

But those are pathological cases.  I tried to prepare a realistic one.
 I changed all comments to parseable "/**" comments in sqlite3 sources
(4.7M before preprocessing) and saw only 1% increase in compile time.

Please note that I did these measurements on a newer version of patch
that actually does some parsing on all comments (determines comment
type and merges consecutive comments).  It doesn't cost much because
comment text is already in cache after lexing.  We can try to do that
lazily, but it will not help reduce vector::push_back time in
pathological cases.

> Now, we may eventually want to actually look at the contents of those comments, and *that* is very likely to cost us performance. However, that's the kind of thing that would affect warnings or other tools, not fundamentally change the AST, so I expect those to be under the control of various flags.

I am in favor of removing this LANGOPT too.

>>  I made this
>> option a LANGOPT, but maybe it should be a BENIGN_LANGOPT?
>
> If we keep it, it should be a BENIGN_LANGOPT, because the presence or absence of comments does not affect the language semantics or ASTs in some incompatible way.

Done.

> Some comments:
>
> @@ -375,7 +375,15 @@
>
>   const TargetInfo *Target;
>   clang::PrintingPolicy PrintingPolicy;
> -
> +
> +  /// \brief Source ranges for all of the comments in the source file,
> +  /// sorted in order of appearance in the translation unit.
> +  std::vector<SourceRange> CommentSourceRanges;
> +
> +  /// \brief Mapping from declarations to their comments, once we have
> +  /// already looked up the comment associated with a given declaration.
> +  llvm::DenseMap<const Decl *, std::string> DeclComments;
> +
>
> I think the value type be StringRef, pointing into ASTContext-allocated memory, so that we can avoid separate allocations for each documentation string we query. As a micro-optimization, this would also allow the StringRef to simply point back into the mmap'd file in the cases where we have a single-line comment, e.g,
>
>        /// My favorite function
>
> Of course, we'd still need a copy for multi-line comments like this
>
>  /**
>     *
>     */
>
> but I bet we'd still get a decent savings from optimizing the single-line case.
>
> +  /// \brief Return the Doxygen-style comment attached to a given declaration.
> +  const char *getCommentForDecl(const Decl *D);
>
> This should also return a StringRef, which will be safe once you're using StringRefs for the DeclComments data structure.

Changed return type to const RawComment *, which contains a
SourceRange and a StringRef to source file.  If we extract comments
into strings, we will loose source locations for diagnostics during
further parsing.

> /**
> + * \brief Given a cursor that represents a declaration, return the associated
> + * comment.
> + */
> +CINDEX_LINKAGE CXString clang_Cursor_getComment(CXCursor C);
>
> Given that we're going to want to provide a function that returns a structured representation of the comment, and that function is what we will want people to be using most of the time, can we put either "raw" or "unprocessed" or "unparsed" into the name somewhere?
>
>        CINDEX_LINKAGE CXString clang_cursor_getRawComment(CXCursor C);
>
> ?
Replaced by:
CXSourceRange clang_Cursor_getCommentRange(CXCursor C)
CXString clang_Cursor_getRawCommentText(CXCursor C)

> You'll want a clang_disposeString() call once you're done with CommentCString.
Done.

>
> +
> +CXString clang_Cursor_getComment(CXCursor C) {
> +  if (!clang_isDeclaration(C.kind))
> +    return createCXString((const char *) NULL);
> +
> +  const Decl *D = getCursorDecl(C);
> +  ASTContext &Context = getCursorContext(C);
> +
> +  return createCXString(Context.getCommentForDecl(D));
> +}
>
> … and with ASTContext:: getCommentForDecl() returning a StringRef into memory owned by the ASTContext, you can pass DupString=false for this call to createCXString, eliminating the extra copy here.

Done.

> +namespace {
> +/// \breif Compare two non-overlapping source ranges.
> +class BeforeInTranslationUnit {
[...]
> Typo "\breif". Don't we have one of these function objects around already?

It was removed at some point.  As per Jordan's comment, moved this
into SourceManager.h.

> +  // If the declaration doesn't map directly to a location in a file, we
> +  // can't find the comment.
> +  // TODO: use the location of the identifier being declared, this way we can
> +  // find after-member comments more precisely when there are multiple
> +  // identifiers declared in one declaration.
> +  SourceLocation DeclLoc = D->getLocStart();
> +  if (DeclLoc.isInvalid() || !DeclLoc.isFileID())
> +    return NULL;
>
> Regarding this TODO, if you simply use D->getLocation(), it will return the location of the name of the declaration (if there is one), which may just do what you want. There are some advantages to using the starting location, though, because it's often on the line before the name
>
>        int
>        X::foo
>
> The trick will be to handle both this and the multi-declarator case well.

Switched to D->getLocation() for now.

> Index: lib/Serialization/ASTReader.cpp
> ===================================================================
> --- lib/Serialization/ASTReader.cpp     (revision 158158)
> +++ lib/Serialization/ASTReader.cpp     (working copy)
> @@ -2426,7 +2426,14 @@
>       }
>       break;
>     }
> +
> +    case COMMENT_SOURCE_RANGES: {
> +      Context.CommentSourceRanges.reserve(Record.size());
> +      for (unsigned Idx = 0; Idx < Record.size(); )
> +        Context.CommentSourceRanges.push_back(ReadSourceRange(F, Record, Idx));
> +      break;
>     }
> +    }
>   }
>   Error("premature end of bitstream in AST file");
>   return Failure;
>
> Please make the deserialization of comment ranges lazy, so that we don't pay the cost of deserializing comments unless we actually use them. There are two typical approaches to doing this within the AST reader/writer:
>
>        a) Write out the comment source range data array as a single blob, which essentially just dumps the bytes of the array into the AST file. The reader then gets them mmap'd in as an array. This is good for large arrays where the cost of deserialization matters and we're prone to random access (e.g., the mapping from declaration ID to its location in the file), but it takes a lot of space in the AST file.
>
>        b) Introduce a new block for the comment source range data, and when we initially see that block, stash its cursor somewhere. Later on, when we need that information, use the cursor to actually deserialize the contents of the block. It's still lazy.
>
> I think (b) is best for this application, because we don't want to waste tons of space in the AST file for something that isn't going to be used by all clients (in particular, the compiler proper doesn't need it).

I agree that it would be nice to do, but I don't see how it should be
done.  Comments are loaded lazily from
ASTContext::getCommentForDecl(), which calls
ASTReader::ReadComments().  To deserialize a SourceLocation we need a
ModuleFile (ASTReader::ReadSourceLocation(ModuleFile &ModuleFile,
unsigned Raw)).  I don't think ModuleFile is available at that time.
Can you give me some advice here?

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/