[llvm] 8f6185c - [AsmParser][ms][X86] Fix possible misbehaviour in parsing of special tokens at start of string.
Anirudh Prasad via llvm-commits
llvm-commits at lists.llvm.org
Wed Apr 21 07:21:15 PDT 2021
Author: Anirudh Prasad
Date: 2021-04-21T10:21:09-04:00
New Revision: 8f6185c71378467e8c27e8996ff2dbaed3e41dbc
URL: https://github.com/llvm/llvm-project/commit/8f6185c71378467e8c27e8996ff2dbaed3e41dbc
DIFF: https://github.com/llvm/llvm-project/commit/8f6185c71378467e8c27e8996ff2dbaed3e41dbc.diff
LOG: [AsmParser][ms][X86] Fix possible misbehaviour in parsing of special tokens at start of string.
- Previously, https://reviews.llvm.org/D72680 introduced a new attribute called `AllowSymbolAtNameStart` (in relation to the MAsmParser changes) in `MCAsmInfo.h` which (according to the comment in the header) allows the following behaviour:
```
/// This is true if the assembler allows $ @ ? characters at the start of
/// symbol names. Defaults to false.
```
- However, the usage of this field in AsmLexer.cpp doesn't seem completely accurate* for a couple of reasons.
```
default:
if (MAI.doesAllowSymbolAtNameStart()) {
// Handle Microsoft-style identifier: [a-zA-Z_$.@?][a-zA-Z0-9_$.@#?]*
if (!isDigit(CurChar) &&
isIdentifierChar(CurChar, MAI.doesAllowAtInName(),
AllowHashInIdentifier))
return LexIdentifier();
}
```
1. The Dollar and At tokens, when occurring at the start of the string, are treated as separate tokens (AsmToken::Dollar and AsmToken::At respectively) and not lexed as an Identifier.
2. I'm not too sure why `MAI.doesAllowAtInName()` is used when `AllowAtInIdentifier` could be used. For X86 platforms, afaict, this shouldn't be an issue, since the `CommentString` attribute isn't "@". (alternatively the call to the setter can be set anywhere else as needed). The `AllowAtInName` does have an additional important meaning, but in the context of AsmLexer, shouldn't mean anything different compared to `AllowAtInIdentifier`
My proposal is the following:
- Introduce 3 new fields called `AllowQuestionTokenAtStartOfString`, `AllowDollarTokenAtStartOfString` and `AllowAtTokenAtStartOfString` in MCAsmInfo.h which will encapsulate the previously documented behaviour of "allowing $, @, ? characters at the start of symbol names")
- Introduce these fields where "$", "@" are lexed, and treat them as identifiers depending on whether `Allow[Dollar|At]TokenAtStartOfString` is set.
- For the sole case of "?", append it to the existing logic for treating a "default" token as an Identifier.
z/OS (HLASM) will also make use of some of these fields in follow up patches.
completely accurate* - This was based on the comments and the intended behaviour the code. I might have completely misinterpreted it, and if that is the case my sincere apologies. We can close this patch if necessary, if there are no changes to be made :)
Depends on https://reviews.llvm.org/D99374
Reviewed By: Jonathan.Crowther
Differential Revision: https://reviews.llvm.org/D99889
Added:
Modified:
llvm/include/llvm/MC/MCAsmInfo.h
llvm/lib/MC/MCParser/AsmLexer.cpp
llvm/lib/Target/X86/MCTargetDesc/X86MCAsmInfo.cpp
llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
Removed:
################################################################################
diff --git a/llvm/include/llvm/MC/MCAsmInfo.h b/llvm/include/llvm/MC/MCAsmInfo.h
index 656cb29e2130..250d219d100e 100644
--- a/llvm/include/llvm/MC/MCAsmInfo.h
+++ b/llvm/include/llvm/MC/MCAsmInfo.h
@@ -181,9 +181,26 @@ class MCAsmInfo {
/// Defaults to false.
bool AllowAtInName = false;
- /// This is true if the assembler allows $ @ ? characters at the start of
- /// symbol names. Defaults to false.
- bool AllowSymbolAtNameStart = false;
+ /// This is true if the assembler allows the "?" character at the start of
+ /// of a string to be lexed as an AsmToken::Identifier.
+ /// If the CommentString is also set to "?", setting this option will have
+ /// no effect, and the string will be lexed as a comment.
+ /// Defaults to false.
+ bool AllowQuestionAtStartOfIdentifier = false;
+
+ /// This is true if the assembler allows the "$" character at the start of
+ /// of a string to be lexed as an AsmToken::Identifier.
+ /// If the CommentString is also set to "$", setting this option will have
+ /// no effect, and the string will be lexed as a comment.
+ /// Defaults to false.
+ bool AllowDollarAtStartOfIdentifier = false;
+
+ /// This is true if the assembler allows the "@" character at the start of
+ /// a string to be lexed as an AsmToken::Identifier.
+ /// If the CommentString is also set to "@", setting this option will have
+ /// no effect, and the string will be lexed as a comment.
+ /// Defaults to false.
+ bool AllowAtAtStartOfIdentifier = false;
/// If this is true, symbol names with invalid characters will be printed in
/// quotes.
@@ -600,7 +617,15 @@ class MCAsmInfo {
const char *getCode64Directive() const { return Code64Directive; }
unsigned getAssemblerDialect() const { return AssemblerDialect; }
bool doesAllowAtInName() const { return AllowAtInName; }
- bool doesAllowSymbolAtNameStart() const { return AllowSymbolAtNameStart; }
+ bool doesAllowQuestionAtStartOfIdentifier() const {
+ return AllowQuestionAtStartOfIdentifier;
+ }
+ bool doesAllowAtAtStartOfIdentifier() const {
+ return AllowAtAtStartOfIdentifier;
+ }
+ bool doesAllowDollarAtStartOfIdentifier() const {
+ return AllowDollarAtStartOfIdentifier;
+ }
bool supportsNameQuoting() const { return SupportsQuotedNames; }
bool doesSupportDataRegionDirectives() const {
diff --git a/llvm/lib/MC/MCParser/AsmLexer.cpp b/llvm/lib/MC/MCParser/AsmLexer.cpp
index 5fe3be42c801..1c33147aa1e7 100644
--- a/llvm/lib/MC/MCParser/AsmLexer.cpp
+++ b/llvm/lib/MC/MCParser/AsmLexer.cpp
@@ -144,7 +144,7 @@ AsmToken AsmLexer::LexHexFloatLiteral(bool NoIntDigits) {
return AsmToken(AsmToken::Real, StringRef(TokStart, CurPtr - TokStart));
}
-/// LexIdentifier: [a-zA-Z_.][a-zA-Z0-9_$.@#?]*
+/// LexIdentifier: [a-zA-Z_$.@?][a-zA-Z0-9_$.@#?]*
static bool isIdentifierChar(char C, bool AllowAt, bool AllowHash) {
return isAlnum(C) || C == '_' || C == '$' || C == '.' || C == '?' ||
(AllowAt && C == '@') || (AllowHash && C == '#');
@@ -769,17 +769,10 @@ AsmToken AsmLexer::LexToken() {
IsAtStartOfStatement = false;
switch (CurChar) {
default:
- if (MAI.doesAllowSymbolAtNameStart()) {
- // Handle Microsoft-style identifier: [a-zA-Z_$.@?][a-zA-Z0-9_$.@#?]*
- if (!isDigit(CurChar) &&
- isIdentifierChar(CurChar, MAI.doesAllowAtInName(),
- AllowHashInIdentifier))
- return LexIdentifier();
- } else {
- // Handle identifier: [a-zA-Z_.][a-zA-Z0-9_$.@]*
- if (isalpha(CurChar) || CurChar == '_' || CurChar == '.')
- return LexIdentifier();
- }
+ // Handle identifier: [a-zA-Z_.?][a-zA-Z0-9_$.@#?]*
+ if (isalpha(CurChar) || CurChar == '_' || CurChar == '.' ||
+ (MAI.doesAllowQuestionAtStartOfIdentifier() && CurChar == '?'))
+ return LexIdentifier();
// Unknown character, emit an error.
return ReturnError(TokStart, "invalid character in input");
@@ -823,13 +816,18 @@ AsmToken AsmLexer::LexToken() {
case '}': return AsmToken(AsmToken::RCurly, StringRef(TokStart, 1));
case '*': return AsmToken(AsmToken::Star, StringRef(TokStart, 1));
case ',': return AsmToken(AsmToken::Comma, StringRef(TokStart, 1));
- case '$':
- if (LexMotorolaIntegers && isHexDigit(*CurPtr)) {
+ case '$': {
+ if (LexMotorolaIntegers && isHexDigit(*CurPtr))
return LexDigit();
- }
-
+ if (MAI.doesAllowDollarAtStartOfIdentifier())
+ return LexIdentifier();
return AsmToken(AsmToken::Dollar, StringRef(TokStart, 1));
- case '@': return AsmToken(AsmToken::At, StringRef(TokStart, 1));
+ }
+ case '@': {
+ if (MAI.doesAllowAtAtStartOfIdentifier())
+ return LexIdentifier();
+ return AsmToken(AsmToken::At, StringRef(TokStart, 1));
+ }
case '\\': return AsmToken(AsmToken::BackSlash, StringRef(TokStart, 1));
case '=':
if (*CurPtr == '=') {
diff --git a/llvm/lib/Target/X86/MCTargetDesc/X86MCAsmInfo.cpp b/llvm/lib/Target/X86/MCTargetDesc/X86MCAsmInfo.cpp
index c294da6baffa..ebbdce442cf3 100644
--- a/llvm/lib/Target/X86/MCTargetDesc/X86MCAsmInfo.cpp
+++ b/llvm/lib/Target/X86/MCTargetDesc/X86MCAsmInfo.cpp
@@ -144,7 +144,9 @@ X86MCAsmInfoMicrosoftMASM::X86MCAsmInfoMicrosoftMASM(const Triple &Triple)
DollarIsPC = true;
SeparatorString = "\n";
CommentString = ";";
- AllowSymbolAtNameStart = true;
+ AllowQuestionAtStartOfIdentifier = true;
+ AllowDollarAtStartOfIdentifier = true;
+ AllowAtAtStartOfIdentifier = true;
}
void X86MCAsmInfoGNUCOFF::anchor() { }
diff --git a/llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp b/llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
index d7e90f1b9a24..12074ddcbd19 100644
--- a/llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
+++ b/llvm/unittests/MC/SystemZ/SystemZAsmLexerTest.cpp
@@ -35,6 +35,15 @@ class MockedUpMCAsmInfo : public MCAsmInfo {
void setAllowAdditionalComments(bool Value) {
AllowAdditionalComments = Value;
}
+ void setAllowQuestionAtStartOfIdentifier(bool Value) {
+ AllowQuestionAtStartOfIdentifier = Value;
+ }
+ void setAllowAtAtStartOfIdentifier(bool Value) {
+ AllowAtAtStartOfIdentifier = Value;
+ }
+ void setAllowDollarAtStartOfIdentifier(bool Value) {
+ AllowDollarAtStartOfIdentifier = Value;
+ }
};
// Setup a testing class that the GTest framework can call.
@@ -454,4 +463,112 @@ TEST_F(SystemZAsmLexerTest, CheckDefaultFloats) {
ExpectedTokens.push_back(AsmToken::Eof);
lexAndCheckTokens(AsmStr, ExpectedTokens);
}
+
+TEST_F(SystemZAsmLexerTest, CheckDefaultQuestionAtStartOfIdentifier) {
+ StringRef AsmStr = "?lh1?23";
+
+ // Setup.
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::Error, AsmToken::Identifier, AsmToken::EndOfStatement,
+ AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
+
+TEST_F(SystemZAsmLexerTest, CheckAcceptQuestionAtStartOfIdentifier) {
+ StringRef AsmStr = "?????lh1?23";
+
+ // Setup.
+ MUPMAI->setAllowQuestionAtStartOfIdentifier(true);
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::Identifier, AsmToken::EndOfStatement, AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
+
+TEST_F(SystemZAsmLexerTest, CheckDefaultAtAtStartOfIdentifier) {
+ StringRef AsmStr = "@@lh1?23";
+
+ // Setup.
+ MUPMAI->setAllowQuestionAtStartOfIdentifier(true);
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::At, AsmToken::At, AsmToken::Identifier,
+ AsmToken::EndOfStatement, AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
+
+TEST_F(SystemZAsmLexerTest, CheckAcceptAtAtStartOfIdentifier) {
+ StringRef AsmStr = "@@lh1?23";
+
+ // Setup.
+ MUPMAI->setAllowAtAtStartOfIdentifier(true);
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::Identifier, AsmToken::EndOfStatement, AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
+
+TEST_F(SystemZAsmLexerTest, CheckAccpetAtAtStartOfIdentifier2) {
+ StringRef AsmStr = "@@lj1?23";
+
+ // Setup.
+ MUPMAI->setCommentString("@");
+ MUPMAI->setAllowAtAtStartOfIdentifier(true);
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ // "@@lj1?23" -> still lexed as a comment as that takes precedence.
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::EndOfStatement, AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
+
+TEST_F(SystemZAsmLexerTest, CheckDefaultDollarAtStartOfIdentifier) {
+ StringRef AsmStr = "$$ac$c";
+
+ // Setup.
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::Dollar, AsmToken::Dollar, AsmToken::Identifier,
+ AsmToken::EndOfStatement, AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
+
+TEST_F(SystemZAsmLexerTest, CheckAcceptDollarAtStartOfIdentifier) {
+ StringRef AsmStr = "$$ab$c";
+
+ // Setup.
+ MUPMAI->setAllowDollarAtStartOfIdentifier(true);
+ setupCallToAsmParser(AsmStr);
+
+ // Lex initially to get the string.
+ Parser->getLexer().Lex();
+
+ SmallVector<AsmToken::TokenKind> ExpectedTokens(
+ {AsmToken::Identifier, AsmToken::EndOfStatement, AsmToken::Eof});
+ lexAndCheckTokens(AsmStr, ExpectedTokens);
+}
} // end anonymous namespace
More information about the llvm-commits
mailing list