[PATCH] D103426: Clang: Extend format string checking to wprintf/wscanf

Sat Jul 17 17:46:03 PDT 2021

MarcusJohnson91 added inline comments.

================
Comment at: clang/include/clang/AST/Expr.h:1846-1871
+  std::string getStringAsChar() const {
+    assert(getCharByteWidth() == 1 &&
+           "This function is used in places that assume strings use char");
+    return std::string(getTrailingObjects<char>(), getTrailingObjects<char>() + getByteLength());
+  }
+                                  
+  std::u16string getStringAsChar16() const {
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > One potential issue to this is that in C++, these types are defined to be UTF-16 and UTF-32, whereas in C, that isn't always the case. Currently, Clang defines `__STDC_UTF_16__` and `__STDC_UTF_32__` on all targets, but we may need to add some sanity checks to catch if a target overrides that behavior. Currently, all the targets in Clang are keeping that promise, but I have no idea what shenanigans downstream users get up to and whether their targets remove the macro definition or define it to `0` instead of `1`.
> Is it possible that the host and target wchar_t have a different size here?
I've honestly been wondering how Clang handled that, in the codebase vs at runtime myself for a while.

================
Comment at: clang/include/clang/AST/FormatString.h:83
     AsMAllocate,  // for '%ms', GNU extension to scanf
+    AsUTF16,      // for '%l16(c|s)', soon to be standardized
+    AsUTF32,      // for '%l32(c|s)', soon to be standardized
----------------
aaron.ballman wrote:
> May want to drop the "soon to be standardized" given that the proposal hasn't been seen by WG14 yet. I think it's fine to say "Clang extension", though. More on the format specifier itself, below.
Done,

and here is the link to the document, I haven't heard any feedback?

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2761.pdf

================
Comment at: clang/include/clang/AST/Type.h:1975
   bool isBooleanType() const;
+  bool isType(const std::string TypeName) const;
   bool isCharType() const;
----------------
aaron.ballman wrote:
> It's not clear from the header file what this function actually *does*. Presumably, everything that can be represented by a `Type` is a type, so I'd assume this returns `true` always. Or should this be a static function?
Yeah, could choose a better name just not sure what.

It takes a type name as the argument, and then it desugars the type one step at a time, and if it finds a match it returns true.

so, let's say we're in C++, and someone typedef'd char8_t to String.

This function will say yes, String is compatible with char8_t for example.

it's mostly for C mode's typedef's of char16_t and char32_t

================
Comment at: clang/lib/AST/Expr.cpp:1197
                                  unsigned *StartTokenByteOffset) const {
-  assert((getKind() == StringLiteral::Ascii ||
-          getKind() == StringLiteral::UTF8) &&
----------------
aaron.ballman wrote:
> I don't see changes that make this assertion valid to remove -- have I missed something?
Nope you didn't miss anything, I did.

this is a remnant from when I was trying to templatize all the format checking code instead of converting the format strings.

Restored the assert.

================
Comment at: clang/lib/AST/FormatString.cpp:235-240
+      if (LO.C2x && I + 1 != E && I[0] == '1' && I[1] == '6') {
+        ++I;
+        ++I;
+        lmKind = LengthModifier::AsUTF16;
+        break;
+      } else if (LO.C2x && I + 1 != E && I[0] == '3' && I[1] == '2') {
----------------
aaron.ballman wrote:
> I don't think this is a conforming extension to C -- lowercase length modifiers and conversion specifiers are reserved for the standard, and uppercase length modifiers and conversion specifiers are reserved for the implementation. `l16` starts with a lowercase letter, so it's reserved for the standard.
> 
> Note: WG14 has been considering extensions in a closely-related space (integer types rather than string or character types) that you may be interested in, if you're not already aware of it: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2511.pdf. This proposal was not adopted, but did have strong sentiment for something along these lines.
Yeah, honestly not really sure what to do about the reservation, L is also used for wide characters and some integers too, and (U|u)(16|32) wasn't taken well by the community.

================
Comment at: clang/lib/AST/OSLog.cpp:212
+  } else if (Lit->isUTF16()) {
+    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> Convert;
+    std::u16string U16 = Lit->getStringAsChar16();
----------------
aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > cor3ntin wrote:
> > > > I'm not sure I have a better suggestion but `codecvt_utf8_utf16` is deprecated in C++17
> > > Good point -- this likely should be lifted into a common interface (as it appears several times in the patch) and make use of the existing LLVM UTF conversion functionality: https://github.com/intel/llvm/blob/sycl/llvm/include/llvm/Support/ConvertUTF.h
> > The ConvertUTF version you linked contains `convertUTF32toUTF8String` but the one in LLVM does not, what's the process for updating this?
> Oh, oops, I managed to grab a link from downstream and not LLVM, sorry about that! But the downstream and LLVM both have a method to perform the conversion https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/ConvertUTF.h#L162, so you could add convenience functions to that file that wraps the API, as part of this patch.
I'll probably write a wrapper then

================
Comment at: clang/lib/AST/PrintfFormatString.cpp:254-257
+  if (ParseLengthModifier(FS, I, E, LO) && I == E) { // THIS IS WHERE IT FREAKS OUT
     // No more characters left?
     if (Warn)
+      printf("ParsePrintfSpecifier: Incomplete Specifier 8\n");
----------------
aaron.ballman wrote:
> Surprising comment and what looks to be some debugging code?
Removed

================
Comment at: clang/lib/AST/PrintfFormatString.cpp:635
       }
-      if (LM.getKind() == LengthModifier::AsWide)
-        return ArgType(ArgType::WCStrTy, "wchar_t *");
----------------
aaron.ballman wrote:
> It looks like this coverage got lost?
Restored

================
Comment at: clang/lib/AST/PrintfFormatString.cpp:844
   case BuiltinType::SChar:
+  case BuiltinType::Char8:
     LM.setKind(LengthModifier::AsChar);
----------------
aaron.ballman wrote:
> Is there a reason why you're adding this support for `char16_t` and `char32_t` but not `char8_t` aside from "C doesn't have `char8_t`?" If that's the sole reason,  you might be interested in tracking: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm
Added char8_t

================
Comment at: clang/lib/AST/Type.cpp:1962

+bool Type::isType(const std::string TypeName) const {
+  QualType Desugar = this->getLocallyUnqualifiedSingleStepDesugaredType();
----------------
aaron.ballman wrote:
> Oh, I see now that this is doing a name comparison against the type -- that's not a good API in general because it's *really* hard to guess at what the type will come out as textually. e.g., `class` and `struct` keywords are interchangeable in C++, C sometimes gets confused with `bool` vs `_Bool`, template arguments sometimes matter, nested name specifiers, etc. Callers of this API will have to guess at these details and the printing of the type may change over time (e.g., C may switch from `_Bool` to `bool` and then code calling `isType("_Bool")` may react poorly to the change).
> 
> I think we need to avoid this sort of API on `Type`.
I see your point, I reverted the behavior back to doing the desugaring in just isChar16Type and isChar32Type

================
Comment at: clang/lib/AST/Type.cpp:1980-2002
+bool Type::isChar16Type(const LangOptions &LangOpts) const {
   if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char16;
-  return false;
-}
-
-bool Type::isChar32Type() const {
-  if (const auto *BT = dyn_cast<BuiltinType>(CanonicalType))
-    return BT->getKind() == BuiltinType::Char32;
-  return false;
-}
-
-/// Determine whether this type is any of the built-in character
-/// types.
-bool Type::isAnyCharacterType() const {
-  const auto *BT = dyn_cast<BuiltinType>(CanonicalType);
-  if (!BT) return false;
-  switch (BT->getKind()) {
-  default: return false;
-  case BuiltinType::Char_U:
-  case BuiltinType::UChar:
-  case BuiltinType::WChar_U:
-  case BuiltinType::Char8:
-  case BuiltinType::Char16:
-  case BuiltinType::Char32:
-  case BuiltinType::Char_S:
-  case BuiltinType::SChar:
-  case BuiltinType::WChar_S:
-    return true;
+    if (BT->getKind() == BuiltinType::Char16)
+      return true;
+  if (!LangOpts.CPlusPlus) {
+    return isType("char16_t");
   }
----------------
aaron.ballman wrote:
> MarcusJohnson91 wrote:
> > aaron.ballman wrote:
> > > If I understand properly, one issue is that `char16_t` is defined by C to be *the same type* as `uint_least16_t`, which itself is a typedef to some other integer type. So we can't make `char16_t` be a distinct type in C, it has to be a typedef to a typedef to an integer type.
> > > 
> > > One concern I have about the approach in this patch is that it makes the type system a bit more convoluted. This type is *not* really a character type in C, it's a typedef type that playacts as a character type. I think this is a pretty fundamental change to the type system in the compiler in some ways because this query is no longer about the canonical type but about the semantics of how the type is expected to be used.
> > > 
> > > I'd definitely like to hear what @rsmith thinks of this approach.
> > I see your point, but I'm not sure how else we could get it through the type checking system without doing it this way?
> > 
> > I tried to be careful to only allow the actual character typedefs through by making sure char16_t or char32_t is in the typedef chain, and only in C mode.
> > I see your point, but I'm not sure how else we could get it through the type checking system without doing it this way?
> 
> I am thinking that rather than doing this work as part of the `Type` object, the extra work can be done from the format string checker. This keeps the confusion about types localized to the part of the code that needs to care about it, rather than pushing it onto all consumers of the type system.
Hmmmm, that's a good idea, but I'm not sure how to do that, let me think on it for a bit.

================
Comment at: clang/lib/Lex/LiteralSupport.cpp:1883-1888
   if (SpellingPtr[0] == 'u' && SpellingPtr[1] == '8')
     SpellingPtr += 2;
-
-  assert(SpellingPtr[0] != 'L' && SpellingPtr[0] != 'u' &&
-         SpellingPtr[0] != 'U' && "Doesn't handle wide or utf strings yet");
+  
+  // Handle UTF-16 strings
+  if (SpellingPtr[0] == 'u' && SpellingPtr[1] != '8')
+    SpellingPtr += 1;
----------------
aaron.ballman wrote:
> 
Did you have anything to say here?

================
Comment at: clang/lib/StaticAnalyzer/Checkers/CheckSecuritySyntaxOnly.cpp:791-808
+    
+    StringLiteral::StringKind Kind = FormatString->getKind();
+    
+    if (Kind == StringLiteral::Ascii || Kind == StringLiteral::UTF8) {
+      String = FormatString->getStringAsChar();
+    } else if (Kind == StringLiteral::UTF16) {
+      std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> Convert;
----------------
cor3ntin wrote:
> I think this code is present in multiple places in your PR, could it be deduplicated?
Yup, I've been working on that too; This was actually the first thing I started with.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103426/new/

https://reviews.llvm.org/D103426