[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Mon Jul 26 20:37:57 PDT 2021

tahonermann added inline comments.

================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+    assert(getCharByteWidth() == 1 &&
+           "This function is used in places that assume strings use char");
+    return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() + getByteLength());
+  }
+                                  
+  std::u16string getStringAsChar16() const {
----------------
MarcusJohnson91 wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > One potential issue to this is that in C++, these types are defined to be UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but we may need to add some sanity checks to catch if a target overrides that behavior. Currently, all the targets in Clang are keeping that promise, but I have no idea what shenanigans downstream users get up to and whether their targets remove the macro definition or define it to `0` instead of `1`.
> > Is it possible that the host and target wchar_t have a different size here?
> I've honestly been wondering how Clang handled that, in the codebase vs at runtime myself for a while.
WG14 N2728 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2728.htm) proposes specifying that `char16_t` and `char32_t` string literals are UTF-16 and UTF-32 encoded respectively.  SG16 previously tried to identify any implementations that use a different encoding for these and was unable to find one.  I think it is safe to only support these encodings for `u` and `U` prefixed string literals.

================
Comment at: clang/lib/AST/ScanfFormatString.cpp:344-347
+        case LengthModifier::AsUTF16:
+          return ArgType(ArgType::Char16Ty);
+        case LengthModifier::AsUTF32:
+          return ArgType(ArgType::Char32Ty);
----------------
Should these additions not include a `PtrTo` wrapper as is done for other types?

================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
   if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-    return true;
+    if (BT->getKind() == BuiltinType::Char16)
+      return true;
+  if (!LangOpts.CPlusPlus) {
+    return isType("char16_t");
   }
----------------
MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > MarcusJohnson91 wrote:
> > > aaron.ballman wrote:
> > > > If I understand properly, one issue is that `char16_t` is defined by C to be *the same type* as `uint_least16_t`, which itself is a typedef to some other integer type. So we can't make `char16_t` be a distinct type in C, it has to be a typedef to a typedef to an integer type.
> > > > 
> > > > One concern I have about the approach in this patch is that it makes the type system a bit more convoluted. This type is *not* really a character type in C, it's a typedef type that playacts as a character type. I think this is a pretty fundamental change to the type system in the compiler in some ways because this query is no longer about the canonical type but about the semantics of how the type is expected to be used.
> > > > 
> > > > I'd definitely like to hear what @rsmith thinks of this approach.
> > > I see your point, but I'm not sure how else we could get it through the type checking system without doing it this way?
> > > 
> > > I tried to be careful to only allow the actual character typedefs through by making sure char16_t or char32_t is in the typedef chain, and only in C mode.
> > > I see your point, but I'm not sure how else we could get it through the type checking system without doing it this way?
> > 
> > I am thinking that rather than doing this work as part of the `Type` object, the extra work can be done from the format string checker. This keeps the confusion about types localized to the part of the code that needs to care about it, rather than pushing it onto all consumers of the type system.
> Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on it for a bit.
What does the format string checker do for an example like `printf("%Ls", L"text")` today?

================
Comment at: clang/lib/Sema/SemaChecking.cpp:9534
         S, inFunctionCall, Args[format_idx],
-        S.PDiag(diag::warn_format_string_is_wide_literal), FExpr->getBeginLoc(),
+        S.PDiag(diag::err_format_invalid_type), FExpr->getBeginLoc(),
         /*IsStringLocation*/ true, OrigFormatExpr->getSourceRange());
----------------
Perhaps Clang should warn about use of the new extension in strict conformance modes.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426