[PATCH] D143142: [clang][lex] Enable Lexer to grow its buffer

David Rector via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Feb 20 06:20:23 PST 2023


davrec added a comment.

Suggested a few adjustments in `LexTokenInternal` you might want to test for perf improvements (orthogonal to this patch, but could boost its numbers :).
And also noted that `Lexer::getBuffer()` has same issue as `getBufferLocation()` re potential for pointer invalidation IIUC.  At a minimum we need comments on these functions explaining any risks; better still to remove them from the public interface.  If downstream users use these functions and complain, good - they need to be aware of this change.



================
Comment at: clang/include/clang/Lex/Lexer.h:285
   /// Gets source code buffer.
-  StringRef getBuffer() const {
-    return StringRef(BufferStart, BufferEnd - BufferStart);
-  }
+  StringRef getBuffer() const { return StringRef(BufferStart, BufferSize); }
 
----------------
Same issue as with `getBufferLocation()`, publicly returning it permits possible pointer invalidation. Fortunately I only see it used in a single spot (prior to this patch anyway) which can be easily eliminated IIUC.  Yank this function?  Or make private/append "Unsafe" to name (and explain in comments)?


================
Comment at: clang/include/clang/Lex/Lexer.h:307
   /// Return the current location in the buffer.
-  const char *getBufferLocation() const { return BufferPtr; }
+  const char *getBufferLocation() const {
+    assert(BufferOffset <= BufferSize && "Invalid buffer state");
----------------
davrec wrote:
> I think I'd like this to return `unsigned`; i.e. I think after this patch we should not even be publicly exposing buffer locations as pointers, IIUC.  A brief search for uses of `getBufferLocation()` (there aren't many) suggests this would be probably be fine and indeed would get rid of some unnecessary pointer arithmetic.  And indeed if anything really needs that `const char *` that might be a red flag to investigate further.
Looking more closely I see that `getCurrentBufferOffset` returns the unsigned, and this patch already changes some `getBufferLocation` usages to `getCurrentBufferOffset`.  So, I say either yank it or make it private or append "Unsafe" and explain in comments.


================
Comment at: clang/lib/Lex/Lexer.cpp:2948-2949
+  if (isHorizontalWhitespace(BufferStart[CurOffset])) {
+  SkipWhitespace(Result, CurOffset + 1, TokAtPhysicalStartOfLine);
+  return false;
   }
----------------
indent


================
Comment at: clang/lib/Lex/Lexer.cpp:2973-2975
+  char Char = getAndAdvanceChar(CurOffset, Tmp);
+  switch (Char) {
+  default:
----------------
indent


================
Comment at: clang/lib/Lex/Lexer.cpp:3630-3632
     do {
-      ++CurPtr;
-    } while (isHorizontalWhitespace(*CurPtr));
+      ++CurOffset;
+    } while (isHorizontalWhitespace(BufferStart[CurOffset]));
----------------
```
for (isHorizontalWhitespace(BufferStart[++CurOffset]);;)
  ;
```
might save a few instructions?  Worth trying since this function is the main perf-critical one.


================
Comment at: clang/lib/Lex/Lexer.cpp:3739-3752
+    if (BufferStart[CurOffset] == '/' && BufferStart[CurOffset + 1] == '/' &&
+        !inKeepCommentMode() && LineComment &&
+        (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP)) {
+      if (SkipLineComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
         return true; // There is a token to return.
       goto SkipIgnoredUnits;
+    } else if (BufferStart[CurOffset] == '/' &&
----------------
Spitballing again for possible minor perf improvements:
```
  if (char Char0 = BufferStart[CurOffset] == '/' && !inKeepCommentMode()) {
    if (char Char1 = BufferStart[CurOffset + 1] == '/' && LineComment && 
        (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP)) {
      if (SkipLineComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
        return true; // There is a token to return.
      goto SkipIgnoredUnits;
    } else if (Char1 == '*') {
      if (SkipBlockComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
        return true; // There is a token to return.
      goto SkipIgnoredUnits;
    }
  } else if (isHorizontalWhitespace(Char0)) {
    goto SkipHorizontalWhitespace;
  }
```


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143142/new/

https://reviews.llvm.org/D143142



More information about the cfe-commits mailing list