[PATCH] D143142: [clang][lex] Enable Lexer to grow its buffer

Mon Feb 20 02:32:41 PST 2023

cor3ntin added a comment.

This is an impressive amount of work. I think it makes sense!
Thanks a lot for doing that work.
I only have a few nits after a first review of this.

================
Comment at: clang/include/clang/Lex/Lexer.h:89

-  // End of the buffer.
-  const char *BufferEnd;
+  // Size of the buffer.
+  unsigned BufferSize;
----------------
Should that use `SourceLocation::UIntTy`?
Looking at comments in SourceManager, I think there was an attempt at supporting > 2GB file but I don't think it got anywhere.
Nevertheless, using `SourceLocation::UIntTy` would arguably be more consistent

It does seem to be a huge undertaking to change it though, I'm not sure it would be worth it at all. There would be far bigger issues with ridiculously large source files anyway.

================
Comment at: clang/include/clang/Lex/Lexer.h:609

-  bool CheckUnicodeWhitespace(Token &Result, uint32_t C, const char *CurPtr);
+  bool CheckUnicodeWhitespace(Token &Result, uint32_t C, unsigned CurOffset);

----------------
davrec wrote:
> FWIW it sucks that `uint32_t` is already sprinkled throughout the interface alongside `unsigned`, wish just one was used consistently, but that does not need to be addressed in this patch.
here, `uint32_t` is a codepoint, should arguably be `char32_t` instead. but agreed, not in this patch.

================
Comment at: clang/lib/Lex/Lexer.cpp:213
+  L->BufferSize = L->BufferOffset + TokLen;
+  assert(L->BufferStart[L->BufferSize] == 0 && "Buffer is not nul terminated!");

----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1203
     if (L && !L->isLexingRawMode())
-      L->Diag(CP-2, diag::trigraph_ignored);
+      L->Diag(CP - 2 - L->getBuffer().data(), diag::trigraph_ignored);
     return 0;
----------------
sunho wrote:
> shafik wrote:
> > I wonder do we really need to do these pointer gymnastics, maybe making this a member function would eliminate the need for it.
> Yes, we can change it to offset here.
Agreed, that would be nice! (In all places `getBuffer().data()` is used)

================
Comment at: clang/lib/Lex/Lexer.cpp:1353
+    if (unsigned EscapedNewLineSize =
+            getEscapedNewLineSize(&BufferStart[Offset])) {
       // Remember that this token needs to be cleaned.
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1378
     // a trigraph warning.  If so, and if trigraphs are enabled, return it.
-    if (char C = DecodeTrigraphChar(Ptr + 2, Tok ? this : nullptr,
-                                    LangOpts.Trigraphs)) {
+    if (char C = DecodeTrigraphChar(&BufferStart[Offset + 2],
+                                    Tok ? this : nullptr, LangOpts.Trigraphs)) {
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1740
+bool Lexer::tryConsumeIdentifierUTF8Char(unsigned &CurOffset) {
+  const char *UnicodePtr = &BufferStart[CurOffset];
   llvm::UTF32 CodePoint;
----------------

================
Comment at: clang/lib/Lex/Lexer.cpp:1744
       llvm::convertUTF8Sequence((const llvm::UTF8 **)&UnicodePtr,
-                                (const llvm::UTF8 *)BufferEnd,
-                                &CodePoint,
-                                llvm::strictConversion);
+                                (const llvm::UTF8 *)&BufferStart[BufferSize],
+                                &CodePoint, llvm::strictConversion);
----------------
Ditto in all similar places, I think it reads easier

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143142/new/

https://reviews.llvm.org/D143142