[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Mon Jun 21 11:43:52 PDT 2021

aaron.ballman added a comment.

I had some time to look into this a bit more today. My review still isn't complete, but there's enough here to warrant some discussion.

================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+    assert(getCharByteWidth() == 1 &&
+           "This function is used in places that assume strings use char");
+    return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() + getByteLength());
+  }
+                                  
+  std::u16string getStringAsChar16() const {
----------------
One potential issue to this is that in C++, these types are defined to be UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but we may need to add some sanity checks to catch if a target overrides that behavior. Currently, all the targets in Clang are keeping that promise, but I have no idea what shenanigans downstream users get up to and whether their targets remove the macro definition or define it to `0` instead of `1`.

================
Comment at: clang/include/clang/AST/FormatString.h:83
     AsMAllocate,  // for '%ms', GNU extension to scanf
+    AsUTF16,      // for '%l16(c|s)', soon to be standardized
+    AsUTF32,      // for '%l32(c|s)', soon to be standardized
----------------
May want to drop the "soon to be standardized" given that the proposal hasn't been seen by WG14 yet. I think it's fine to say "Clang extension", though. More on the format specifier itself, below.

================
Comment at: clang/include/clang/AST/Type.h:1975
   bool isBooleanType() const;
+  bool isType(const std::string TypeName) const;
   bool isCharType() const;
----------------
It's not clear from the header file what this function actually *does*. Presumably, everything that can be represented by a `Type` is a type, so I'd assume this returns `true` always. Or should this be a static function?

================
Comment at: clang/include/clang/Basic/DiagnosticSemaKinds.td:6961
+def err_format_invalid_type : Error<
+  "format strings type is invalid">;
 def err_unexpected_interface : Error<
----------------
There are no tests for this diagnostic. Also, should the diagnostic specify which format string type is invalid (given that format strings can contain multiple type markings)?

Also, this replaces an existing diagnostic that's a warning, but is now upgraded to an error. Why is an error appropriate when it was warning before? Did you consider leaving this as a warning but marking it as defaulting to an error, so users can still disable the diagnostic and allow their code to compile?

================
Comment at: clang/lib/AST/Expr.cpp:1197
                                  unsigned *StartTokenByteOffset) const {
-  assert((getKind() == StringLiteral::Ascii ||
-          getKind() == StringLiteral::UTF8) &&
----------------
I don't see changes that make this assertion valid to remove -- have I missed something?

================
Comment at: clang/lib/AST/FormatString.cpp:235-240
+      if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') {
+        ++I;
+        ++I;
+        lmKind = LengthModifier::AsUTF16;
+        break;
+      } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') {
----------------
I don't think this is a conforming extension to C -- lowercase length modifiers and conversion specifiers are reserved for the standard, and uppercase length modifiers and conversion specifiers are reserved for the implementation. `l16` starts with a lowercase letter, so it's reserved for the standard.

Note: WG14 has been considering extensions in a closely-related space (integer types rather than string or character types) that you may be interested in, if you're not already aware of it: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was not adopted, but did have strong sentiment for something along these lines.

================
Comment at: clang/lib/AST/OSLog.cpp:225
+  
+  StringRef Data(String);
+  
----------------
Is this actually needed, or can you use a `std::string` iterator when calling `ParsePrintfString()`?

================
Comment at: clang/lib/AST/PrintfFormatString.cpp:254-257
+  if (ParseLengthModifier(FS, I, E, LO) && I == E) { // THIS IS WHERE IT FREAKS OUT
     // No more characters left?
     if (Warn)
+      printf("ParsePrintfSpecifier: Incomplete Specifier 8\n");
----------------
Surprising comment and what looks to be some debugging code?

================
Comment at: clang/lib/AST/PrintfFormatString.cpp:635
       }
-      if (LM.getKind() == LengthModifier::AsWide)
-        return ArgType(ArgType::WCStrTy, "wchar_t *");
----------------
It looks like this coverage got lost?

================
Comment at: clang/lib/AST/PrintfFormatString.cpp:844
   case BuiltinType::SChar:
+  case BuiltinType::Char8:
     LM.setKind(LengthModifier::AsChar);
----------------
Is there a reason why you're adding this support for `char16_t` and `char32_t` but not `char8_t` aside from "C doesn't have `char8_t`?" If that's the sole reason,  you might be interested in tracking: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm

================
Comment at: clang/lib/AST/TemplateBase.cpp:77-80
+  const Decl *D = nullptr;
+  if (T->isTypedefNameType()) {
+    D = T->getAs<TypedefType>()->getDecl();
+  }
----------------
This approach seems fragile for getting access to an AST context. I think `TemplateArgument::print()` should likely be given an `const ASTContext &` parameter instead.

================
Comment at: clang/lib/AST/Type.cpp:1962

+bool Type::isType(const std::string TypeName) const {
+  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
----------------
Oh, I see now that this is doing a name comparison against the type -- that's not a good API in general because it's *really* hard to guess at what the type will come out as textually. e.g., `class` and `struct` keywords are interchangeable in C++, C sometimes gets confused with `bool` vs `_Bool`, template arguments sometimes matter, nested name specifiers, etc. Callers of this API will have to guess at these details and the printing of the type may change over time (e.g., C may switch from `_Bool` to `bool` and then code calling `isType("_Bool")` may react poorly to the change).

I think we need to avoid this sort of API on `Type`.

================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
   if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-    return true;
+    if (BT->getKind() == BuiltinType::Char16)
+      return true;
+  if (!LangOpts.CPlusPlus) {
+    return isType("char16_t");
   }
----------------
MarcusJohnson91 wrote:
> aaron.ballman wrote:
> > If I understand properly, one issue is that `char16_t` is defined by C to be *the same type* as `uint_least16_t`, which itself is a typedef to some other integer type. So we can't make `char16_t` be a distinct type in C, it has to be a typedef to a typedef to an integer type.
> > 
> > One concern I have about the approach in this patch is that it makes the type system a bit more convoluted. This type is *not* really a character type in C, it's a typedef type that playacts as a character type. I think this is a pretty fundamental change to the type system in the compiler in some ways because this query is no longer about the canonical type but about the semantics of how the type is expected to be used.
> > 
> > I'd definitely like to hear what @rsmith thinks of this approach.
> I see your point, but I'm not sure how else we could get it through the type checking system without doing it this way?
> 
> I tried to be careful to only allow the actual character typedefs through by making sure char16_t or char32_t is in the typedef chain, and only in C mode.
> I see your point, but I'm not sure how else we could get it through the type checking system without doing it this way?

I am thinking that rather than doing this work as part of the `Type` object, the extra work can be done from the format string checker. This keeps the confusion about types localized to the part of the code that needs to care about it, rather than pushing it onto all consumers of the type system.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1883-1888
   if (SpellingPtr[0] == 'u' && SpellingPtr[1] == '8')
     SpellingPtr += 2;
-
-  assert(SpellingPtr[0] != 'L' && SpellingPtr[0] != 'u' &&
-         SpellingPtr[0] != 'U' && "Doesn't handle wide or utf strings yet");
+  
+  // Handle UTF-16 strings
+  if (SpellingPtr[0] == 'u' && SpellingPtr[1] != '8')
+    SpellingPtr += 1;
----------------

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426