[cfe-commits] r112458 - in /cfe/trunk: include/clang/AST/ include/clang/Basic/ include/clang/Lex/ include/clang/Sema/ lib/AST/ lib/Checker/ lib/CodeGen/ lib/Lex/ lib/Parse/ lib/Sema/ test/Parser/ test/SemaCXX/ tools/libclang/

Mon Aug 30 10:23:49 PDT 2010

On Aug 29, 2010, at 2:26 PM, Sean Hunt wrote:

> Author: coppro
> Date: Sun Aug 29 16:26:48 2010
> New Revision: 112458
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=112458&view=rev
> Log:
> Implement C++0x user-defined string literals.
> 
> The extra data stored on user-defined literal Tokens is stored in extra
> allocated memory, which is managed by the PreprocessorLexer because there isn't
> a better place to put it that makes sure it gets deallocated, but only after
> it's used up. My testing has shown no significant slowdown as a result, but
> independent testing would be appreciated.

Hi Sean, I'll let Doug noodle on the ast and sema aspects of this, here are some other comments.  The short version of this is that I really want you to revert this while you work out the issues.

-Chris

> +++ cfe/trunk/include/clang/Lex/Lexer.h Sun Aug 29 16:26:48 2010
> @@ -17,6 +17,7 @@
> #include "clang/Lex/PreprocessorLexer.h"
> #include "clang/Basic/LangOptions.h"
> #include "llvm/ADT/SmallVector.h"
> +#include "llvm/Support/Allocator.h"
> #include <string>
> #include <vector>
> #include <cassert>
> @@ -67,6 +68,9 @@
>   // line" flag set on it.
>   bool IsAtStartOfLine;
> 
> +  // ExtraDataAllocator - An allocator for extra data on a token.
> +  llvm::BumpPtrAllocator ExtraDataAllocator;

I'm pretty sure that this isn't going to work.  Lexers are transient and can be destroyed at the same time a token is returned.  Specifically, when the last token of a buffer is lexed, the token is returned and the lexer is destroyed.  It seems to me that this will cause a dangling pointer into the bump allocator that the token refers to.

> +++ cfe/trunk/include/clang/Lex/Preprocessor.h Sun Aug 29 16:26:48 2010
> @@ -645,7 +645,7 @@
>   /// copy).  The caller is not allowed to modify the returned buffer pointer
>   /// if an internal buffer is returned.
>   unsigned getSpelling(const Token &Tok, const char *&Buffer, 
> -                       bool *Invalid = 0) const;
> +                       bool *Invalid = 0, bool LiteralOnly = false) const;

What is this argument?  You don't document it and it doesn't make any sense just looking at it.  Why is it coming after the Invalid pointer? 

> +++ cfe/trunk/include/clang/Lex/Token.h Sun Aug 29 16:26:48 2010
> @@ -14,16 +14,16 @@
> #ifndef LLVM_CLANG_TOKEN_H
> #define LLVM_CLANG_TOKEN_H
> 
> +#include "llvm/Support/Allocator.h"
> #include "clang/Basic/TemplateKinds.h"
> #include "clang/Basic/TokenKinds.h"
> #include "clang/Basic/SourceLocation.h"
> #include "clang/Basic/OperatorKinds.h"
> +#include "clang/Basic/IdentifierTable.h"
> #include <cstdlib>
> 
> namespace clang {
> 
> -class IdentifierInfo;
> -
> /// Token - This structure provides full information about a lexed token.
> /// It is not intended to be space efficient, it is intended to return as much
> /// information as possible about each returned token.  This is expected to be
> @@ -34,6 +34,14 @@
> /// can be represented by a single typename annotation token that carries
> /// information about the SourceRange of the tokens and the type object.
> class Token {
> +  /// An extra-large structure for storing the data needed for a user-defined
> +  /// literal - the raw literal, and the identifier suffix.
> +  struct UDLData {
> +    IdentifierInfo *II;
> +    const char *LiteralData;
> +    unsigned LiteralLength;
> +  };

I don't understand why these have to be stored in the token.  Why not reinterpret this when doing sema, just like the digits in a number are reinterpreted when doing sema.  The lexer doesn't store the numeric value for "42" in the token after all.

> +  /// Various flags set per token:
>   enum TokenFlags {
> -    StartOfLine   = 0x01,  // At start of line or only after whitespace.
> -    LeadingSpace  = 0x02,  // Whitespace exists before this token.
> -    DisableExpand = 0x04,  // This identifier may never be macro expanded.
> -    NeedsCleaning = 0x08   // Contained an escaped newline or trigraph.
> +    StartOfLine   =       0x01,  ///< At start of line or only after whitespace
> +    LeadingSpace  =       0x02,  ///< Whitespace exists before this token
> +    DisableExpand =       0x04,  ///< This identifier may never be macro expanded
> +    NeedsCleaning =       0x08,  ///< Contained an escaped newline or trigraph
> +    UserDefinedLiteral =  0x10,  ///< This literal has a ud-suffix
> +    LiteralPortionClean = 0x20   ///< A UDL's literal portion needs no cleaning

Likewise, UserDefinedLiteral doesn't seem appropriate here.
> +++ cfe/trunk/lib/Lex/Lexer.cpp Sun Aug 29 16:26:48 2010
> @@ -548,6 +548,11 @@
>   isInited = true;
> }
> 
> +/// isIdentifierStart - Return true if this is the start character of an
> +/// identifier, which is [a-zA-Z_].
> +static inline bool isIdentifierStart(unsigned char c) {
> +  return (CharInfo[c] & (CHAR_LETTER|CHAR_UNDER)) ? true : false;
> +}
> 
> /// isIdentifierBody - Return true if this is the body character of an
> /// identifier, which is [a-zA-Z0-9_].
> @@ -982,8 +987,30 @@
> 
>   // Update the location of the token as well as the BufferPtr instance var.
>   const char *TokStart = BufferPtr;
> -  FormTokenWithChars(Result, CurPtr,
> -                     Wide ? tok::wide_string_literal : tok::string_literal);
> +  tok::TokenKind Kind = Wide ? tok::wide_string_literal : tok::string_literal;
> +
> +  // FIXME: Handle UCNs
> +  unsigned Size;
> +  if (PP && PP->getLangOptions().CPlusPlus0x &&
> +      isIdentifierStart(getCharAndSize(CurPtr, Size))) {
> +    Result.makeUserDefinedLiteral(ExtraDataAllocator);
> +    Result.setFlagValue(Token::LiteralPortionClean, !Result.needsCleaning());
> +    Result.setKind(Kind);
> +    Result.setLiteralLength(CurPtr - BufferPtr);
> +
> +    // FIXME: We hack around the lexer's routines a lot here.
> +    BufferPtr = CurPtr;

This is gross.  Why exactly do you need to do this?  You're violating several very important invariants here: for example you're accessing the preprocessor in raw mode, which will break "a lot of important stuff" (go look at who uses raw mode).

-Chris