[cfe-commits] Comment AST and parser

Mon Jul 2 09:39:45 PDT 2012

On Jun 29, 2012, at 5:40 PM, Dmitri Gribenko wrote:

> Hello,
> 
> I'm working on comment AST and real comment parser.  Just wanted to
> gather opinions or alternative views or receive general feedback.
> 
> I decided to model the AST along the ideas of HTML AST: block and
> inline content.
> 
> * Block content is a paragraph or a command that has a paragraph as an
> argument or verbatim command.
> * Inline content is placed within some block.  Inline content includes
> plain text, inline commands and HTML as tag soup.
> 
> Here are the AST nodes:
> 
> $ cat include/clang/Basic/CommentNodes.td
> class Comment<bit abstract = 0> {
>  bit Abstract = abstract;
> }
> 
> class DComment<Comment base, bit abstract = 0> : Comment<abstract> {
>  Comment Base = base;
> }
> 
> def InlineContentComment : Comment<1>;
>  def TextComment : DComment<InlineContentComment>;
>  def NewlineComment : DComment<InlineContentComment>;
>  def InlineCommandComment : DComment<InlineContentComment>;
>  def HTMLTagComment : DComment<InlineContentComment, 1>;
>    def HTMLOpenTagComment : DComment<HTMLTagComment>;
>    def HTMLCloseTagComment : DComment<HTMLTagComment>;
> 
> def BlockContentComment : Comment<1>;
>  def ParagraphComment : DComment<BlockContentComment>;
>  def BlockCommandComment : DComment<BlockContentComment>;
>    def ParamCommandComment : DComment<BlockCommandComment>;
>    def VerbatimBlockComment : DComment<BlockCommandComment>;
>    def VerbatimLineComment : DComment<BlockCommandComment>;
> 
> def VerbatimBlockLineComment : Comment;
> 
> def FullComment : Comment;
> 
> Here is an example AST:
> 
> === Source:
> // \brief Aaa
> //
> // Bbb
> === AST:
> (FullComment 0x7fe3c10061c0 <:1:3, :3:7>
>  (ParagraphComment 0x7fe3c1006040 <:1:3, :1:4>
>    (TextComment 0x7fe3c1006010 <:1:3, :1:4> Text=" "))
>  (BlockCommandComment 0x7fe3c1006070 <:1:4, :1:14> Name="brief"
>    (ParagraphComment 0x7fe3c1006130 <:1:10, :1:14>
>      (TextComment 0x7fe3c1006100 <:1:10, :1:14> Text=" Aaa")))
>  (ParagraphComment 0x7fe3c1006190 <:3:3, :3:7>
>    (TextComment 0x7fe3c1006160 <:3:3, :3:7> Text=" Bbb")))
> 
> See attached files for a patch (WIP) and more example ASTs.
> 
> Some implementation details: I am not a big fan of
> TextTokenRetokenizer class I'm introducing with this patch, but this
> seems to be way to go about splitting existing text tokens.


This is looking great! I like this direction a lot. Some specific comments below.

@@ -427,12 +428,15 @@
   /// \brief Mapping from declarations to their comments (stored within
   /// Comments list), once we have already looked up the comment associated
   /// with a given declaration.
-  mutable llvm::DenseMap<const Decl *, const RawComment *> DeclComments;
+  mutable llvm::DenseMap<const Decl *, const RawComment *> DeclRawComments;
 
   /// \brief Return the documentation comment attached to a given declaration,
   /// without looking into cache.
   const RawComment *getRawCommentForDeclNoCache(const Decl *D) const;
 
+  mutable llvm::DenseMap<const Decl *, comments::FullComment *>
+      DeclParsedComments;
+
 public:
   void addComment(const RawComment &RC) {
     Comments.addComment(RC);

Should these two DenseMaps be collapsed into one

	DenseMap<const Decl *, std::pair<const RawComment *, comments::FullComment *> >

to reduce the number of lookups we need to do?

+/// Any part of the comment.
+/// Abstract class.
+class Comment {
+protected:
+  // Type of this AST node.
+  unsigned Kind : 8;
+
+  // Preffered location to show caret.
+  SourceLocation Loc;

"Preffered" should be "Preferred".

+public:
+  enum CommentKind {
+    NoCommentKind = 0,
+#define COMMENT(CLASS, PARENT) CLASS##Kind,
+#define COMMENT_RANGE(BASE, FIRST, LAST) \
+    first##BASE##Constant=FIRST##Kind, last##BASE##Constant=LAST##Kind,
+#define LAST_COMMENT_RANGE(BASE, FIRST, LAST) \
+    first##BASE##Constant=FIRST##Kind, last##BASE##Constant=LAST##Kind
+#define ABSTRACT_COMMENT(COMMENT)
+#include "clang/AST/CommentNodes.inc"
+  };

"first" and "last" here should be uppercase, please.

+/// A newline.
+class NewlineComment : public InlineContentComment {
+public:

Why have a special AST node for newlines, rather than either embedding \n's or having a "there's a newline following this inline comment" bit? Abstractly, I guess having an AST node to cope with might make it easier for clients of the comment AST to translate it properly, or is there another reason?

+
+class HTMLOpenTagComment : public HTMLTagComment {
+public:
+  class Attribute {
+    StringRef Name;
+    StringRef Value;
+    SourceRange Range;
+  };

Please add a comment for this class. A reminder that open/close comments are *not* automatically paired by the representation would be wonderful, here.

+private:
+  ArrayRef<Attribute *> Attributes;

Might it be better to just have an ArrayRef<Attribute>, to avoid the extra indirections? Copying Attribute instances around is fairly fast.

+
+class HTMLCloseTagComment : public HTMLTagComment {

Comment please!

+/// Doxygen \\param command.
+class ParamCommandComment : public BlockCommandComment {
+public:
+  enum PassDirection {
+    DIR_IN,
+    DIR_OUT,
+    DIR_IN_OUT
+  };
+

Name these PD_In, PD_Out, PD_InOut, perhaps? Or even In, Out, InOut, since they're nested in ParamCommandComment.

+/// Verbatim block that (e. g., preformatted code).
+class VerbatimBlockComment : public BlockCommandComment {
+public:

Partial sentence in this comment.

+
+class VerbatimBlockLineComment : public Comment {
+public:
+  static bool classof(const Comment *C) {
+    return C->getCommentKind() == VerbatimBlockLineCommentKind;
+  }
+
+  static bool classof(const VerbatimBlockLineComment *) { return true; }
+
+  child_iterator child_begin() const { return NULL; }
+
+  child_iterator child_end() const { return NULL; }
+};
+
+class VerbatimLineComment : public BlockCommandComment {};

Please comment these classes; I'm not sure I understand the difference between them.

+ParamCommandComment *Sema::actOnParamCommandArg(ParamCommandComment *Command,
+                                                SourceLocation ArgLocBegin,
+                                                SourceLocation ArgLocEnd,
+                                                StringRef Arg,
+                                                bool IsDirection) {
+  if (IsDirection) {
+    ParamCommandComment::PassDirection Direction;
+    std::string ArgLower = Arg.lower();
+    // TODO: optimize: lower Name first (need an API in SmallString for that),
+    // after that StringSwitch.
+    if (ArgLower == "[in]")
+      Direction = ParamCommandComment::DIR_IN;
+    else if (ArgLower == "[out]")
+      Direction = ParamCommandComment::DIR_OUT;
+    else if (ArgLower == "[in,out]" || ArgLower == "[out,in]")
+      Direction = ParamCommandComment::DIR_IN_OUT;
+    else {
+      Direction = ParamCommandComment::DIR_IN;
+      // Diag() unrecognized parameter passing direction, valid directions are ...
+    }

I imagine that we'll want to deal with spaces around the ',', for people who get it wrong?

+bool Sema::isBlockCommand(StringRef Name) {
+  return llvm::StringSwitch<bool>(Name)
+      .Case("brief", true)
+      .Case("result", true)
+      .Case("return", true)
+      .Case("returns", true)
+      .Case("author", true)
+      .Case("authors", true)
+      .Case("pre", true)
+      .Case("post", true)
+      .Default(false) || isParamCommand(Name);
+}
+
+bool Sema::isParamCommand(StringRef Name) {
+  return llvm::StringSwitch<bool>(Name)
+      .Case("param", true)
+      .Case("arg", true)
+      .Default(false);
+}

To be TableGen'd at a later date, I'm sure ;)

Index: utils/TableGen/TableGen.cpp
===================================================================

--- utils/TableGen/TableGen.cpp	(revision 159471)
+++ utils/TableGen/TableGen.cpp	(working copy)
@@ -38,6 +38,7 @@
   GenClangDiagsDefs,
   GenClangDiagGroups,
   GenClangDiagsIndexName,
+  GenClangCommentNodes,
   GenClangDeclNodes,
   GenClangStmtNodes,
   GenClangSACheckers,
@@ -86,6 +87,8 @@
                     clEnumValN(GenClangDiagsIndexName,
                                "gen-clang-diags-index-name",
                                "Generate Clang diagnostic name index"),
+                    clEnumValN(GenClangCommentNodes, "gen-clang-comment-nodes",
+                               "Generate Clang AST comment nodes"),
                     clEnumValN(GenClangDeclNodes, "gen-clang-decl-nodes",
                                "Generate Clang AST declaration nodes"),
                     clEnumValN(GenClangStmtNodes, "gen-clang-stmt-nodes",
@@ -148,6 +151,9 @@
     case GenClangDiagsIndexName:
       EmitClangDiagsIndexName(Records, OS);
       break;
+    case GenClangCommentNodes:
+      EmitClangASTNodes(Records, OS, "Comment", "");
+      break;
     case GenClangDeclNodes:
       EmitClangASTNodes(Records, OS, "Decl", "Decl");
       EmitClangDeclContext(Records, OS);

Is there more that you intend to do with TableGen for the AST nodes themselves? It looks like all we're getting is the list of node names, but TableGen is a pretty heavyweight way to keep that up-to-date.

	- Doug