[PATCH] D100346: [Clang] String Literal and Wide String Literal Encoding from the Preprocessor

Mon Apr 12 15:03:19 PDT 2021

ThePhD created this revision.
ThePhD added a reviewer: aaron.ballman.
ThePhD added a project: clang.
ThePhD requested review of this revision.
Herald added a subscriber: cfe-commits.

**//Short version://**

Please let us know the encoding that the compiler chooses for it's implementation-defined, non-Unicode string literals so we can support users properly in cross-platform code.

**//Prior Art://**

A similar feature has already been patch-reviewed and merged into GCC trunk (I implemented it), ready to go out the door with GCC 11. It is compiler-specific, and that is intentional. It solved a user's bug report there.

I also put in a Feature Request for MSVC. It is also recommended to be compiler-specific. They are currently suffering the very interesting consequences of not handling it sooner. stdlib library developers having to come up with library-based workarounds to determine the charset format of their string literals and praying rather than having a compiler macro for it with their std::fmt implementation: https://github.com/microsoft/STL/pull/1824 | https://github.com/microsoft/STL/issues/1576 | https://developercommunity.visualstudio.com/content/idea/1160821/-compiler-feature-macro-for-narrow-literal-foo-enc.html

The C++ Standard's Committee Study Group 16 - Unicode approved a paper that is currently undergoing LEWG to determine the string literal and wide string literal encoding at both compile-time and runtime; this patch prepares for the compile-time portion of that detection, which Corentin Jabot already created a proof-of-concept of for Clang, GCC and MSVC: https://wg21.link/p1885

I missed the 12 release, so I hope this makes it for 13.

**//Long version://**

C and C++'s string literals for both "narrow"/"multibyte" string literals (e.g. "foo") and "wide" string literals (e.g. L"foo") have an associated encoding defined by the implementation. Recently, a review has kicked off for both adding new "execution encodings" (e.g., string literal encodings) to Clang's Preprocessor and, subsequently, C and C++ frontends.

I left a comment for it to be taken care of but I'm certain the comment was drowned out by other contributions in both the -fexec-charset addition and the "Add support for iconv encodings and other things" patch review at:

iconv Literal Converter: https://reviews.llvm.org/D88741
-fexec-charset Enabling Patch: https://reviews.llvm.org/D93031

Whether or not this gets updated in the related (but not required) patches, this is necessary to successfully inform the end user on a Clang machine what the wide string literal and the narrow string literal encoding is.

We use the size of the wide character type (`wchar_t`) to inform our decision, as Windows and other old-style 32-bit IBM machines use UTF-16, while most Linux distributions use UTF-32. (This is not the case for IBM and other machines of specific make in China, Japan, and Korea, but I suspect Clang has not been ported to work on such machines.)

Knowing the literal and execution encodings is also of great importance to the C++ Standard Committee in general, as they have work coming down the pipeline that has been generally approved by SG16 and favorably reviewed by LEWG that will make use of such functionality soon, as mentioned in the Prior Art section above: https://wg21.link/p1885

Please consider making everyone who cares about portable encoding's lives easier, and please consider making the work on `-fexec-charset` and `-fwide-exec-charset`.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D100346

Files:
  clang/lib/Frontend/InitPreprocessor.cpp
  clang/test/Preprocessor/init.c


Index: clang/test/Preprocessor/init.c
===================================================================

--- clang/test/Preprocessor/init.c
+++ clang/test/Preprocessor/init.c
@@ -119,6 +119,8 @@
 // COMMON:#define __clang_minor__ {{[0-9]+}}
 // COMMON:#define __clang_patchlevel__ {{[0-9]+}}
 // COMMON:#define __clang_version__ {{.*}}
+// COMMON:#define __clang_literal_encoding__ {{.*}}
+// COMMON:#define __clang_wide_literal_encoding__ {{.*}}
 // COMMON:#define __llvm__ 1
 //
 // RUN: %clang_cc1 -E -dM -triple=x86_64-pc-win32 < /dev/null | FileCheck -match-full-lines -check-prefix C-DEFAULT %s
Index: clang/lib/Frontend/InitPreprocessor.cpp
===================================================================
--- clang/lib/Frontend/InitPreprocessor.cpp
+++ clang/lib/Frontend/InitPreprocessor.cpp
@@ -778,6 +778,20 @@
     }
   }
 
+  // macros to help identify the narrow and wide character sets
+  // NOTE: clang currently ignores -fexec-charset=. If this changes,
+  // then may need to change.
+  Builder.defineMacro("__clang_literal_encoding__", "\"UTF-8\"");
+  if (TI.getTypeWidth(TI.getWCharType()) >= 32) {
+    // NOTE: 32-bit wchar_t signals UTF-32. This may change if 
+    // -fwide-exec-charset= is ever supported.
+    Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-32\"");
+  } else {
+    // NOTE: Less-than 32-bit wchar_t generally means UTF-16 (e.g., Windows, 32-bit IBM).
+    // This may change if -fwide-exec-charset= is ever supported.
+    Builder.defineMacro("__clang_wide_literal_encoding__", "\"UTF-16\"");
+  }
+
   if (LangOpts.Optimize)
     Builder.defineMacro("__OPTIMIZE__");
   if (LangOpts.OptimizeSize)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D100346.336967.patch
Type: text/x-patch
Size: 1662 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20210412/9becdb3e/attachment.bin>