[PATCH] D105759: [WIP] Implement P2361 Unevaluated string literals

Mon Sep 27 06:30:55 PDT 2021

aaron.ballman added a comment.

Should we also change this to parse an unevaluated string literal: https://github.com/llvm/llvm-project/blob/main/clang/lib/Lex/ModuleMap.cpp#L1594
Similar question for all the uses in OpenMP stemming from: https://github.com/llvm/llvm-project/blob/main/clang/lib/Parse/ParseOpenMP.cpp#L825 (https://www.openmp.org/spec-html/5.1/openmp.html is the OpenMP spec but I didn't see any mention of encoding prefixes when I peeked)

I think we still need to handle UDLs so that we parse the declaration's `""` as an unevaluated string literal.

================
Comment at: clang/include/clang/AST/Expr.h:1845
   StringRef getString() const {
-    assert(getCharByteWidth() == 1 &&
+    assert((isUnevaluated() || getCharByteWidth() == 1) &&
            "This function is used in places that assume strings use char");
----------------
Do we also want to assert that if it is unevaluated, it's char byte width *is* one byte? (No such thing as a multibyte unevaluated string literal.)

================
Comment at: clang/include/clang/Basic/DiagnosticParseKinds.td:58

+
 def warn_misleading_indentation : Warning<
----------------
You can revert the unintended whitespace change to drop this whole file from the review.

================
Comment at: clang/lib/AST/Expr.cpp:1082
   StringLiteralBits.NumConcatenated = NumConcatenated;
+  StringLiteralBits.CharByteWidth = 1;
+
----------------
This should be in an `else` clause along with `StringLiteralBits.IsPascal = false;`.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1648
         // have the same ud-suffix.
-        if (UDSuffixBuf != UDSuffix) {
+        const bool UnevaluatedStringHasUDL = Unevaluated && !UDSuffix.empty();
+        if (UDSuffixBuf != UDSuffix || UnevaluatedStringHasUDL) {
----------------

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:95-96
+  case '?':
+  case 'n':
+  case 't':
+    return true;
----------------
aaron.ballman wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > Do you intend to miss a bunch of escapes like `\'` and `\r` (etc)?
> > \' is there. I am less sure about '\r' and '\a'. for example. This is something I realized after writing P2361.
> > what does '\a` in static assert mean? even '\r' is not so obvious
> Looking at the list again, I think only `\a` is really of interest here. I know some folks like @jfb have mentioned that `\a` could be used to generate an alert sound on a terminal, which is a somewhat useful feature for a failed static assertion if you squint at it hard enough.
> 
> But the rest of the missing ones do seem more questionable to support.
@jfb and @cor3ntin -- any opinions on whether `\a` should be supported? My opinion is that it should be supported because it has some utility for anyone running the compiler from a command line, but it's a pretty weak opinion.

================
Comment at: clang/lib/Lex/Pragma.cpp:807
   if (Tok.is(tok::string_literal) && !Tok.hasUDSuffix()) {
     StringLiteralParser Literal(Tok, PP);
     if (Literal.hadError)
----------------
Should this also be modified?

================
Comment at: clang/lib/Parse/ParseDecl.cpp:374-376
+  ParsedAttr::Kind AttrKind =
+      ParsedAttr::getParsedKind(AttrName, ScopeName, Syntax);
+
----------------
I don't think this needed to move?

================
Comment at: clang/lib/Parse/ParseDecl.cpp:422-423

-        ExprResult ArgExpr(
-            Actions.CorrectDelayedTyposInExpr(ParseAssignmentExpression()));
+        ExprResult ArgExpr(Actions.CorrectDelayedTyposInExpr(
+            ParseAttributeArgAsUnevaluatedLiteralOrExpression(AttrKind)));
         if (ArgExpr.isInvalid()) {
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > cor3ntin wrote:
> > > aaron.ballman wrote:
> > > > Hmmm, I'm not certain about these changes.
> > > > 
> > > > For some attributes, the standard currently requires accepting any kind of string literal (like `[[deprecated]]` https://eel.is/c++draft/dcl.attr.deprecated#1). P2361 is proposing to change that, but it's not yet accepted by WG21 (let alone WG14). So giving errors in those cases is a bit of a hard sell -- I think warnings would be a bit more reasonable.
> > > > 
> > > > But for other attributes (like `annotate`), it's a bit less clear whether we should *prevent* literal prefixes because the attribute can be used to have runtime impacts (for example, I can imagine someone using `annotate` to emit the string literal bytes into the resulting binary). In some cases, I think it's very reasonable (e.g., `diagnose_if` should behave the same as `deprecated` and `nodiscard` because those are purely about generating diagnostics at compile time).
> > > > 
> > > > I kind of wonder whether we're going to want to tablegenerate whether the argument needs to be parsed as unevaluated or not on an attribute-by-attribute basis.
> > > Yep, I would not expect this to get merge before P2361 but I think the implementation experience is useful and raised a bunch of good questions.
> > > I don't think it ever makes sense to have `L` outside of literals - but people *might* do it currently, in which case there is a concern about whether it breaks code (I have found no evidence of that though).
> > > 
> > > If we wanted to inject these strings in the binary - in some form, then we might have to transcode them at that point.
> > > I don't think the user would know if the string would be injected as wide or narrow (or something else) by the compiler.
> > > 
> > > `L` is really is want to convert that string _at that point_. in an attribute, strings might have multiple usages so it's better to delay any transcoding.
> > > Does that make sense?
> > > 
> > > But I agree that a survey of what each attribute expects is in order.
> > > 
> > > 
> > > 
> > > Yep, I would not expect this to get merge before P2361 but I think the implementation experience is useful and raised a bunch of good questions.
> > 
> > Absolutely agreed, this is worthwhile effort!
> > 
> > > If we wanted to inject these strings in the binary - in some form, then we might have to transcode them at that point.
> > > I don't think the user would know if the string would be injected as wide or narrow (or something else) by the compiler.
> > 
> > My intuition is that a user who writes `L"foo"` will expect a wide `"foo"` to appear in the binary in the cases where the string ends up making it that far.
> > 
> > > L is really is want to convert that string _at that point_. in an attribute, strings might have multiple usages so it's better to delay any transcoding.
> > > Does that make sense?
> > 
> > Not yet, but I might get there eventually. :-D My concern is that vendor attributes can basically do anything, so there's no reason to assume that any given string literal usage should or should not transcode. I think we have to decide on a case by case basis by letting the attributes specify what they intend in their argument lists. However, my intuition is that *most* attributes will expect unevaluated string literals because the string argument doesn't get passed to LLVM.
> The status quo is that everything transcodes.
> 
> But not transcoding, we do not destroy any information as to what is in the source.
> 
> If an attribute then wants to use the string later in such a way that it needs to transcode to a literal encoding (or something else, for example, one might imagine a fun scenario where literal are ASCII encoded and debug information are EBCDIC encoded), then that can be done, because the string still exists.
> 
> Whereas for literal specifically, we assume they will be evaluated by the abstract machine as per phase 5 so we transcode them immediately. which is destructive. we get away with it because the original spelling is in the source if we need it, and currently, literals are also assumed to be (potentially invalid because of `\x` escape sequences) UTF-8.
> 
> There is an alternative design where string literals are not transcoded until lazily evaluated but I'm not sure there is a big motivation for that.
> 
> So this PR is exactly trying not to force a specific behavior on attributes that I assume can be displayed, put into some form in the binary, or converted to literal which might represent 3 distinct encodings. The parser leaving them as Unicode is the least opinionated thing the parser can possibly do.
> And then each attribute can decide for itself if it needs to transcode, and how to handle any errors if they occur.
> An attribute might decide to keep both a Unicode and non-Unicode spelling around if the string has a dual purpose, etc
> 
> 
> Question though, Is there a scenario in which `\x`/`\0` would actually be useful in the context of attributes? Because if so, then we might need to do something to allow that.
> Question though, Is there a scenario in which \x/\0 would actually be useful in the context of attributes? Because if so, then we might need to do something to allow that.

Emitting binary data is the biggest use case I can think of, but I don't think we have any Clang attributes that do this currently. It's possible there are plugin-based attributes that need that functionality, but it also seems unlikely.

================
Comment at: clang/lib/Sema/SemaDeclAttr.cpp:363

-  if (!Literal || !Literal->isAscii()) {
+  // TODO all StringLiteral here should be unevaluated
+
----------------
I'm not certain what's left to be TOdone here?

================
Comment at: clang/lib/Sema/SemaDeclCXX.cpp:16077
-  if (!Lit->isAscii()) {
-    Diag(LangStr->getExprLoc(), diag::err_language_linkage_spec_not_ascii)
-      << LangStr->getSourceRange();
----------------
This diagnostic can be removed from DiagnosticSemaKinds.td now.

================
Comment at: clang/lib/Sema/SemaExpr.cpp:1825
+ExprResult Sema::ActOnUnevaluatedStringLiteral(ArrayRef<Token> StringToks) {
+  StringLiteralParser Literal(StringToks, PP, true);
+  if (Literal.hadError)
----------------

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105759/new/

https://reviews.llvm.org/D105759