[PATCH] D105759: Implement P2361 Unevaluated string literals

Wed Jun 28 12:30:53 PDT 2023

cor3ntin added a comment.

In D105759#4456864 <https://reviews.llvm.org/D105759#4456864>, @aaron.ballman wrote:

> I don't think it's correct to assume that all string arguments to attributes are unevaluated, but it is hard to tell where to draw the line sometimes. Backing up a step, as I understand P2361 <https://reviews.llvm.org/P2361>, an unevaluated string is one which is not converted into the execution character set (effectively). Is that correct? If so, then as an example, `[[clang::annotate()]]` should almost certainly be using an evaluated string because the argument is passed down to LLVM IR and is used in ways we cannot predict. What's more, an unevaluated string cannot have some kinds of escape characters (numeric and conditional escape sequences) and those are currently allowed by `clang::annotate` and could potentially be used by a backend plugin.
>
> I think other attributes may have similar issues. For example, the `alias` attribute is a bit of a question mark for me -- that takes a string literal representing an external identifier that is looked up. I'm not certain whether that should be in the execution character set or not, but we do support escape sequences for it: https://godbolt.org/z/v65Yd7a68
>
> I think we need to track evaluated vs not on the argument level so that the attributes in Attr.td can decide which form to use. I think we should default to "evaluated" for any attribute we're on the fence about because that's the current behavior they get today (so we should avoid regressions).

I really don't think it makes sense to have both "unevaluated" and "evaluated" arguments.
We chatted offline and we struggle to find places where escape sequences are used, or examples of attributes intended to be in the execution character set.

My suggestion would be to land the non-attributes changes now, and the attributes bits in early clang 18.
If we find clear example of attributes expecting execution character set, they should be able to be described as an expression, which will be checked as a string literal anyway, hopefully?

In the case of annotate, if these are fed, for example to a debugger, their may need to convert to whatever the debugger expect as encoding, which is not necessarily the execution charset,
Same for plugins, they certainly not expect ebcdic data, for example.
I would expect for example static analyzers and code generator to keep working after the introduction of fexec-charset
So it's important that it remains unevaluated in the front end so that it can be correctly converted to the appropriate encoding of the various consumers. Which doesn't have a single answer

> Do we know of any attributes in the "needs more thinking" list that should have the string literal encoded in the execution character set? I think most of these are for referring to identifiers in source and I expect those would want source character set and not execution character set strings.

Identifiers and symbol names are in UTF8, and may get mangle through, for example replacing non-ascii codepoints by UCN. The source character set is never relevant
This address the WebAsm attributes

> BTFDeclTag/BTFTypeTag (is emitted to DWARF with -g so probably evaluated?)

Is it correct to assume the debugger file encoding is always the same as the program's ? Probably not!
If need be, we can then transcode the strings when doing codegen for these things

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D105759/new/

https://reviews.llvm.org/D105759