[PATCH] D124221: Reimplement `__builtin_dump_struct` in Sema.

Aaron Ballman via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu May 5 04:45:01 PDT 2022


aaron.ballman accepted this revision.
aaron.ballman added a comment.
This revision is now accepted and ready to land.

Aside from a minor nit, I think this LGTM. Thank you for working on this!



================
Comment at: clang/lib/Sema/SemaChecking.cpp:476
+    analyze_printf::PrintfSpecifier Specifier;
+    if (Specifier.fixType(T, S.getLangOpts(), S.Context, /*IsObjCLiteral=*/false)) {
+      // We were able to guess how to format this.
----------------
Note for future reference: I don't think `fixType()` does anything special to help us out with arrays.


================
Comment at: clang/lib/Sema/SemaChecking.cpp:507
+                       unsigned Depth) {
+    // FIXME: Decide what to do if RD is a union. At least we should probably
+    // turn off printing `const char*` members with `%s`, because that is very
----------------
My suggestion is to print the value of all fields and treat any string-like types as `%p` rather than `%s`. I'm fine with this being done later.


================
Comment at: clang/lib/Sema/SemaChecking.cpp:542
+      auto *FD = IFD ? IFD->getAnonField() : dyn_cast<FieldDecl>(D);
+      if (!FD || (FD->isUnnamedBitfield() || FD->isAnonymousStructOrUnion()))
+        continue;
----------------



================
Comment at: clang/lib/Sema/SemaChecking.cpp:433
+
+  llvm::StringRef getFormatSpecifier(QualType T) {
+    if (auto *BT = T->getAs<BuiltinType>()) {
----------------
yihanaa wrote:
> yihanaa wrote:
> > erichkeane wrote:
> > > rsmith wrote:
> > > > yihanaa wrote:
> > > > > rsmith wrote:
> > > > > > rsmith wrote:
> > > > > > > aaron.ballman wrote:
> > > > > > > > yihanaa wrote:
> > > > > > > > > I think this is better maintained in "clang/AST/FormatString.h". For example analyze_printf::PrintfSpecifier can get format specifier, what do you all think about?
> > > > > > > > +1 to this suggestion -- my hope is that we could generalize it more then as I notice there are missing specifiers for things like intmax_t, size_t, ptrdiff_t, _Decimal types, etc. Plus, that will hopefully make it easier for __builtin_dump_struct to benefit when new format specifiers are added, such as ones for printing a _BitInt.
> > > > > > > I am somewhat uncertain: every one of these is making arbitrary choices about how to format the value, so it's not clear to me that this is general logic rather than something specific to `__builtin_dump_struct`. For example, using `%f` rather than one of the myriad other ways of formatting `double` is somewhat arbitrary. Using `%s` for any `const char*` is *extremely* arbitrary and will be wrong and will cause crashes in some cases, but it may be the pragmatically correct choice for a dumping utility. A general-purpose mechanism would use `%p` for all kinds of pointer.
> > > > > > > 
> > > > > > > We could perhaps factor out the formatting for cases where there is a clear canonical default formatting, such as for integer types and probably `%p` for pointers, then call that from here with some special-casing, but without a second consumer for that functionality it's really not clear to me what form it should take.
> > > > > > I went ahead and did this, mostly to match concurrent changes to the old implementation. There are a few cases where our existing "guess a format specifier" logic does the wrong thing for dumping purposes, which I've explicitly handled -- things like wanting to dump a `char` / `signed char` / `unsigned char` member as a number rather than as a (potentially non-printable or whitespace) character.
> > > > >  When I was patching that old implementation, I found that for uint8_t, int8_t, Clang's existing "guess a format specifier" logic would treat the value as an integer, but for unsigned char, signed char, char types, it would Treat it as a character, please look at this example ( https://godbolt.org/z/ooqn4468T ), I guess this existing logic may have made some special judgment.
> > > > Yeah. I think in the case where we see some random call to `printf`, `%c` is probably the right thing to guess here, but it doesn't seem appropriate to me to use this in a dumping routine. If we could dump as `'x'` for printable characters and as `'\xAB'` for everything else, that'd be great, but `printf` can't do that itself and I'm not sure we want to be injecting calls to `isprint` or whatever to make that work. Dumping as an integer always seems like it's probably the least-bad option.
> > > > 
> > > > Somewhat related: I wonder if we should use `"\"%s\""` instead of simply `"%s"` when dumping a `const char*`. That's not ideal but probably clearer than the current dump output.
> > > I see value to having strings with SOME level of delimiter, if at least to handle cases when the string itself has a newline in it.
> > This looks good, but the string may need to be escaped (if there are some special characters in the original string)
> Maybe users should wrap the printf function, and then pass the wrapper function as a parameter, so that they can do some custom things
> Somewhat related: I wonder if we should use "\"%s\"" instead of simply "%s" when dumping a const char*. That's not ideal but probably clearer than the current dump output.

I'd be in favor of that. (I sort of wonder whether we do want to inject calls to `isprint()` though -- it seems like escaping nonprintable characters would be the kindest option for both characters and strings.)


================
Comment at: clang/lib/Sema/SemaChecking.cpp:573
+          // We don't know how to print this field. Print out its address
+          // with a format specifier that a smart tool will be able to
+          // recognize and treat specially.
----------------
rsmith wrote:
> aaron.ballman wrote:
> > rsmith wrote:
> > > yihanaa wrote:
> > > > Can we generate a warning diagnostic message when we see an unsupported type (perhaps temporarily not supported), what do you all think about?
> > > We could, but I'm leaning towards thinking that we shouldn't warn by default at least. That would get in the way of the C++ use case where the type actually is supported by the printing function; I'm not sure that we can easily tell those cases apart. And it's probably not all that useful to someone who just wants whatever information we can give them while debugging. I'd be OK with adding a warning that's off by default, though. What would be the intended use case here? Would an off-by-default warning satisfy it?
> > So long as we don't trigger UB, I think it's fine to not diagnose for unsupported types. But the UB cases do strike me as something we might want to warn about if we think they can be hit (specifically, these kinds of UB: https://eel.is/c++draft/cstdarg.syn#1).
> > 
> > As an example, do we properly handle dumping this case:
> > ```
> > struct S {
> >   int &i;
> >   S(int &ref) : i(ref) {}
> > };
> > ```
> > where the vararg argument is of reference type? I suppose another interesting question is:
> > ```
> > struct S {
> >   int i;
> > };
> > 
> > void func() {
> >   S s;
> >   [=]() {
> >     __builtin_dump_struct(&s, ::printf);
> >   };
> > }
> > ```
> > (is S.i "an entity resulting from a lambda capture"?)
> The only things we pass to the formatting function are:
> * The small number of builtin types for which we have known formatting specifiers.
> * Pointers to `void`, using `%p`.
> * Pointers to `[const] char`, using `%s`.
> * Pointers to user-defined types, using `*%p`.
> All of these can be passed through a `...` without problems.
> 
> We can potentially introduce the following sources of UB:
> * Indirection through a struct pointer that doesn't point to a struct object.
> * Read of an inactive union member.
> * Read of an uninitialized member.
> * Printing something other than a pointer to a nul-terminated byte sequence with `%s`.
> I'm not sure how much we should be worrying about those. It'd be nice to avoid them, though I don't think we can fix the fourth issue without giving up the `%s` formatting entirely.
> 
> Reference types aren't on our list of types with known formatting specifiers, so in your first example we'd call `printf("%s%s = *%p", "int &", "i", &p->i)`, producing output like `int & i = *0x1234abcd`. We could special-case that to also dereference the reference and include the value (`int & i = *0x1234abcd (5)`, but for now I'm treating it like any other type for which we don't have a special formatting rule. I've added test coverage for this case.
> 
> In your second example, we rewrite into `::printf("%s%s = %s", "int", "i", (&s)->i)`, which is fine. The `parmN` restriction is about calls to `va_start`:
> ```
> void f(int a, ...) {
>   [&] {
>     va_list va;
>     va_start(va, a); // error, `a` is captured here
>     // ...
>   };
> }
> ```
Thank you for the explanation, it sounds like we're doing about as good as we can get here in terms of safety.

FWIW, I like the idea of outputting the value for a reference type, but am fine punting on that for now (it could get awkward to format when the referenced type is a structure, for example).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124221/new/

https://reviews.llvm.org/D124221



More information about the cfe-commits mailing list