[cfe-commits] [PATCH] Lexer for structured comments

Mon Jun 25 15:17:08 PDT 2012

Proper patch attached.

On Mon, Jun 25, 2012 at 2:37 PM, Dmitri Gribenko <gribozavr at gmail.com> wrote:
> Hi Doug,
>
> Thank you for the review!  Comments inline, updated patch attached.
>
> On Mon, Jun 25, 2012 at 10:45 AM, Douglas Gregor <dgregor at apple.com> wrote:
>> Index: include/clang/AST/RawCommentList.h
>> ===================================================================
>> --- include/clang/AST/RawCommentList.h  (revision 159034)
>> +++ include/clang/AST/RawCommentList.h  (working copy)
>> @@ -91,12 +91,22 @@
>>   unsigned getBeginLine(const SourceManager &SM) const;
>>   unsigned getEndLine(const SourceManager &SM) const;
>>
>> +  StringRef getBriefText(const SourceManager &SourceMgr) const {
>> +    if (BriefTextValid)
>> +      return BriefText;
>> +
>> +    return extractBriefText(SourceMgr);
>> +  }
>> +
>>  private:
>>   SourceRange Range;
>>
>>   mutable StringRef RawText;
>> -  mutable bool RawTextValid : 1; ///< True if RawText is valid
>> +  mutable std::string BriefText;
>>
>> Should BriefText be ASTContext-allocated, so that RawComment can stay POD-like?
>
> Done.
>
>> +namespace tok {
>> +enum TokenKind {
>> +  eof,
>> +  newline,
>> +  text,
>> +  command,
>> +  verbatim_block_begin,
>> +  verbatim_block_line,
>> +  verbatim_block_end,
>> +  verbatim_line,
>> +  html_tag_open_name,
>> +  html_tag_open_attr,
>>
>> I find the choice of html_tag_open_name/html_tag_open_attr interesting. Presumably, you're doing this so that we don't need to have '=' and string literals as token kinds (because we generally don't want them as token kinds), and that makes sense. However, shouldn't there be an html_tag_open_end token as well, to capture the '>'? Otherwise, the parser has to either intuit the ending tag based on having a non-html_tag_open* token or poke at the last character in each html_tag_open_* token.
>
> The idea was to eat html_tag_open_attr until we see a
> non-html_tag_open_attr token.
>
>> I see that you don't have location information for the '=' or for the value in TextValue2, although I guess you can intuit it from the length?
>
> Oh, the locations for '=' and '>' are lost...  Seems like I'll need to
> introduce '<', html_ident, '=', html_quoted_string, '>' as tokens in
> html state.  Done.
>
>> +  /// Contains text value associated with a token.
>> +  StringRef TextValue1;
>> +
>> +  /// Contains text value associated with a token.
>> +  StringRef TextValue2;
>>
>> It would be nice if Token were a POD type. Could we just store ptr/length for TextValue1 and TextValue2, so that Token remains a POD?
>
> Done.
>
>> Index: tools/libclang/CIndex.cpp
>> ===================================================================
>> --- tools/libclang/CIndex.cpp   (revision 159034)
>> +++ tools/libclang/CIndex.cpp   (working copy)
>> @@ -5707,7 +5707,25 @@
>>
>>  } // end: extern "C"
>>
>> +CXString clang_Cursor_getBriefCommentText(CXCursor C) {
>> +  if (!clang_isDeclaration(C.kind))
>> +    return createCXString((const char *) NULL);
>>
>> +  const Decl *D = getCursorDecl(C);
>> +  ASTContext &Context = getCursorContext(C);
>> +  const RawComment *RC = Context.getRawCommentForDecl(D);
>> +
>> +  if (RC && RC->isDoxygen()) {
>>
>> I was about to ask "Why shouldn't this work with JavaDoc or Qt/HeaderDoc comments?", but the only issue here is that "isDoxygen()" isn't a good name choice. How about simply "isDocumentation()"?
>
> Agreed, done.
>
>> +  /// Registered verbatim-like block commands.
>> +  VerbatimBlockCommandVector VerbatimBlockCommands;
>> <snip>
>> +  /// Registered verbatim-like line commands.
>> +  VerbatimLineCommandVector VerbatimLineCommands;
>> <snip>
>> +  /// \brief Register a new verbatim block command.
>> +  void addVerbatimBlockCommand(StringRef BeginName, StringRef EndName);
>> +
>> +  /// \brief Register a new verbatim line command.
>> +  void addVerbatimLineCommand(StringRef Name);
>>
>> Do we need this much generality? It seems like a StringSwitch for each case (verbatim block and verbatim line) would suffice for now, and we could optimize that later by TableGen'ing the string matcher for the various Doxygen/HeaderDoc/Qt/JavaDoc command names.
>
> Converted to StringSwitch, but retained the vector search.  I think we
> need both: string matcher to handle predefined commands and a vector
> to handle commands registered dynamically.
>
> But the StringSwitch should go because upcoming parsing and semantic
> analysis know command properties in detail.  Thus, command lists would
> be duplicated here and there.  Will need to eventually Tablegen this.
>
>> +  while (TokenPtr != CommentEnd) {
>> +    switch(*TokenPtr) {
>> +      case '\\':
>> +      case '@': {
>> +        TokenPtr++;
>> +        if (TokenPtr == CommentEnd) {
>> +          formTokenWithChars(T, TokenPtr, tok::text);
>> +          T.setText(StringRef(BufferPtr - T.getLength(), T.getLength()));
>> +          return;
>> +        }
>> +        char C = *TokenPtr;
>> +        if (StringRef("\\@&$#<>%\".:").find(C) != StringRef::npos) {
>>
>> This 'if' would almost surely be more efficient as a switch. More importantly, it should only apply to the '\\' case, and not the '@' case.
>
> (a) Converted to switch. (b) Doxygen allows @ as escape character, for
> example @\c will yield \c.
>
>> +        // Hardcoded support for lexing \f$ \f[ \f] \f{ \f} as a single
>> +        // command.
>> +        if (Length == 1 && TokenPtr[-1] == 'f' && TokenPtr != CommentEnd) {
>> +          C = *TokenPtr;
>> +          if (C == '$' || C == '[' || C == ']' || C == '{' || C == '}') {
>> +            TokenPtr++;
>> +            Length++;
>> +          }
>> +        }
>> +
>>
>> Could you mention the term 'formula' here, so the reader knows what the heck '\f' actual means?
>
> Done.
>
>> Also, I suspect that the various \f[…] and \f(…) commands actually to paren-, brace-, and bracket-matching internally. Please add a FIXME here so we don't forget to deal with this later (but, of course, it's not at all important to solve at this point!).
>
> Doxygen doesn't check that.  And actually it might be harmful to do so
> because a complex LaTeX formula (with \phantom, for example) may have
> paren imbalance.
>
>> +        if (isVerbatimBlockCommand(CommandName, EndName)) {
>> +          setupAndLexVerbatimBlock(T, TokenPtr, *BufferPtr, EndName);
>> +          return;
>> +        }
>> +        else if (isVerbatimLineCommand(CommandName)) {
>> +          lexVerbatimLine(T, TokenPtr);
>> +          return;
>> +        } else {
>> +          formTokenWithChars(T, TokenPtr, tok::command);
>> +          T.setCommandName(CommandName);
>> +          return;
>> +        }
>>
>> Since each of the 'if' branches returns, you can remove the else's and de-nest the last bit, e.g.,
>
> Done.
>
>> +      default: {
>> +        while (true) {
>> +          TokenPtr++;
>> +          if (TokenPtr == CommentEnd)
>> +            break;
>> +          char C = *TokenPtr;
>> +          if(C == '\n' || C == '\r' ||
>> +             C == '\\' || C == '@' || C == '<')
>> +            break;
>> +        }
>>
>> I hadn't realized it before, but now I see in the code that the string tokens returned by the lexer are not 'maximal', in the sense that they don't necessarily cover an entire string. For example, something like "foo \& bar" comes in as three tokens, "foo ", "&", and " bar". That makes
>
> Seems like the sentence ends unexpectedly here, but I got the point.
> But returning maximal tokens will lead to memory allocation for
> unescaped string (currently we get away with returning pointers into
> source file).
>
> No change done.
>
>> +void Lexer::lexVerbatimBlockFirstLine(Token &T) {
>> +  assert(BufferPtr < CommentEnd);
>> +
>> +  // Extract current line.
>> +  const char *Newline = findNewline(BufferPtr, CommentEnd);
>> +  StringRef Line(BufferPtr, Newline - BufferPtr);
>> +
>> +  // Look for end command in current line.
>> +  size_t Pos = Line.find(VerbatimBlockEndCommandName);
>> +  const char *NextLine;
>>
>> Performance-wise, it would be better for us to only scan the line text once, finding either the verbatim block-end command or the location of the newline. For now, a FIXME will suffice.
>
> Added a FIXME.
>
>> +void Lexer::lex(Token &T) {
>> +again:
>> +  switch (CommentState) {
>> +  case LCS_BeforeComment:
>> +    if (BufferPtr == BufferEnd) {
>> +      formTokenWithChars(T, BufferPtr, tok::eof);
>> +      return;
>> +    }
>> +
>> +    assert(*BufferPtr == '/');
>> +    BufferPtr++; // Skip first slash.
>> +    switch(*BufferPtr) {
>> +    case '/': { // BCPL comment.
>> +      BufferPtr++; // Skip second slash.
>> +
>> +      if (BufferPtr != BufferEnd) {
>> +        // Skip Doxygen magic marker.
>> +        const char C = *BufferPtr;
>> +        if (C == '/' || C == '!')
>> +          BufferPtr++;
>> +      }
>>
>> Presumably, we have to perform this last 'if' check because the third '/' might be missing when '///<' is typo'd to '//<'? A comment here might help.
>
> It might also be missing if we merged comments like this:
> /// 1
> // 2
> /// 3
> int f(int);
>
> Added comments.
>
>> +    case '*': { // C comment.
>> +      BufferPtr++; // Skip star.
>> +
>> +      // Skip Doxygen magic marker.
>> +      const char C = *BufferPtr;
>> +      if ((C == '*' && *(BufferPtr + 1) != '/') || C == '!')
>> +        BufferPtr++;
>> +
>> +      // Skip less-than symbol that marks trailing comments.
>> +      if (BufferPtr != CommentEnd && *BufferPtr == '<')
>> +        BufferPtr++;
>>
>> Should we (somewhere, at some point) eliminate lines of all '*'s in C comments, e.g.,
>>
>> /**************
>>  * Docs
>>  **************/
>>
>> ?
>>
>> This seems not at all important now, but we should consider it for the future. (FIXME somewhere, perhaps).
>
> I think that it should be done while post-processing the comment text
> (in CommentSema, i think).  Actually, this looks very much like
> markdown 'rulers'.
>
>> +    // Turn any whitespace between comments (and there is only whitespace
>> +    // between them) into a newline.  We have two newlines between comments
>> +    // in total (first one was synthenized after a comment).
>>
>> Typo "synthenized"
>>
>> +        // Synthenize newline just after the C comment, regardless if there is
>> +        // actually a newline.
>>
>> Same typo here.
>
> Done.
>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: comments-lexer-v4.patch
Type: application/octet-stream
Size: 77765 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120625/06e43a54/attachment.obj>