[PATCH] D143142: [clang][lex] Enable Lexer to grow its buffer
Corentin Jabot via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Mon Feb 20 02:32:41 PST 2023
cor3ntin added a comment.
This is an impressive amount of work. I think it makes sense!
Thanks a lot for doing that work.
I only have a few nits after a first review of this.
================
Comment at: clang/include/clang/Lex/Lexer.h:89
- // End of the buffer.
- const char *BufferEnd;
+ // Size of the buffer.
+ unsigned BufferSize;
----------------
Should that use `SourceLocation::UIntTy`?
Looking at comments in SourceManager, I think there was an attempt at supporting > 2GB file but I don't think it got anywhere.
Nevertheless, using `SourceLocation::UIntTy` would arguably be more consistent
It does seem to be a huge undertaking to change it though, I'm not sure it would be worth it at all. There would be far bigger issues with ridiculously large source files anyway.
================
Comment at: clang/include/clang/Lex/Lexer.h:609
- bool CheckUnicodeWhitespace(Token &Result, uint32_t C, const char *CurPtr);
+ bool CheckUnicodeWhitespace(Token &Result, uint32_t C, unsigned CurOffset);
----------------
davrec wrote:
> FWIW it sucks that `uint32_t` is already sprinkled throughout the interface alongside `unsigned`, wish just one was used consistently, but that does not need to be addressed in this patch.
here, `uint32_t` is a codepoint, should arguably be `char32_t` instead. but agreed, not in this patch.
================
Comment at: clang/lib/Lex/Lexer.cpp:213
+ L->BufferSize = L->BufferOffset + TokLen;
+ assert(L->BufferStart[L->BufferSize] == 0 && "Buffer is not nul terminated!");
----------------
================
Comment at: clang/lib/Lex/Lexer.cpp:1203
if (L && !L->isLexingRawMode())
- L->Diag(CP-2, diag::trigraph_ignored);
+ L->Diag(CP - 2 - L->getBuffer().data(), diag::trigraph_ignored);
return 0;
----------------
sunho wrote:
> shafik wrote:
> > I wonder do we really need to do these pointer gymnastics, maybe making this a member function would eliminate the need for it.
> Yes, we can change it to offset here.
Agreed, that would be nice! (In all places `getBuffer().data()` is used)
================
Comment at: clang/lib/Lex/Lexer.cpp:1353
+ if (unsigned EscapedNewLineSize =
+ getEscapedNewLineSize(&BufferStart[Offset])) {
// Remember that this token needs to be cleaned.
----------------
================
Comment at: clang/lib/Lex/Lexer.cpp:1378
// a trigraph warning. If so, and if trigraphs are enabled, return it.
- if (char C = DecodeTrigraphChar(Ptr + 2, Tok ? this : nullptr,
- LangOpts.Trigraphs)) {
+ if (char C = DecodeTrigraphChar(&BufferStart[Offset + 2],
+ Tok ? this : nullptr, LangOpts.Trigraphs)) {
----------------
================
Comment at: clang/lib/Lex/Lexer.cpp:1740
+bool Lexer::tryConsumeIdentifierUTF8Char(unsigned &CurOffset) {
+ const char *UnicodePtr = &BufferStart[CurOffset];
llvm::UTF32 CodePoint;
----------------
================
Comment at: clang/lib/Lex/Lexer.cpp:1744
llvm::convertUTF8Sequence((const llvm::UTF8 **)&UnicodePtr,
- (const llvm::UTF8 *)BufferEnd,
- &CodePoint,
- llvm::strictConversion);
+ (const llvm::UTF8 *)&BufferStart[BufferSize],
+ &CodePoint, llvm::strictConversion);
----------------
Ditto in all similar places, I think it reads easier
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D143142/new/
https://reviews.llvm.org/D143142
More information about the cfe-commits
mailing list