[PATCH] D95514: [clang][cli] Teach CompilerInvocation to allocate strings on its own

Duncan P. N. Exon Smith via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Jan 27 11:52:09 PST 2021


dexonsmith added a comment.

In D95514#2525517 <https://reviews.llvm.org/D95514#2525517>, @jansvoboda11 wrote:

> In D95514#2525255 <https://reviews.llvm.org/D95514#2525255>, @dexonsmith wrote:
>
>> Can you include the use in the patch?
>> I also wonder what command-line option is forcing this.
>
> The use-case that requires it is here: https://github.com/llvm/llvm-project/blob/f3449ed6073cac58efd9b62d0eb285affa650238/clang/lib/Frontend/CompilerInvocation.cpp#L734.
> Clang accepts analyzer arguments in form of `-analyzer-config key=value`. Reference to the `key` sub-string is stored as an index in a `StringMap` (`AnalyzerOptions::Config`) and values are held by `std::string`. More entries get stored into the map in `parseAnalyzerConfigs`.
>
>> String arguments could reuse the original char*.
>
> Interesting idea. We could avoid allocation (and the need for owning strings) by grabbing the map key substring, scanning past its end until we hit `\0` and putting the whole thing into generated arguments. Keys in the `StringMap` that are not followed by `=value` most likely only come from `parseAnalyzerConfigs`, but we don't need to generate those.
> This could work, but I'm worried it's somewhat brittle.

I think (?) running the tests with ASan on would catch any bugs, and there are bots that do that. Might be worth double-checking though.

You'd also want to `assert(!Arg.end()[0] && "reused command-line arg not null-terminated?")`...

>> It seems better to drop frivolous `=` than carrying this around.
>
> Cool, this could save us allocations in some cases. Not sure dropping `=` between key-value pairs would result in good user experience though.

Well, the compiler developers are the users, since `-cc1` isn't for end-users. I acknowledge it's a tradeoff. IMO, without actually seeing it in practice, it seems a bit nicer to design this away entirely. But maybe having a `=` is more valuable than I think.

Here's another option. Assume we have a broken-up version of `UniqueStringSaver` that let's us pre-load the existing strings and move away the new ones for later storage, something like:

  BumpPtrAllocator Storage;
  DenseSet<StringRef> ExistingStrings;
  StringSaver NewStrings(Storage);
  
  auto saveString = [&](const Twine &T) {
    SmallVector<char, 128> Storage;
    StringRef S = T.toStringRef(Storage);
    auto I = ExistingStrings.find(S);
    if (I != ExistingStrings.end())
      return *I;
    S = NewStrings.save(S);
    ExistingStrings.insert(S);
    return S;
  };
  
  Strings.ExistingStrings.reserve(CommandLineArgs.size());
  for (StringRef Arg : CommandLineArgs)
    Strings.ExistingStrings.insert(Arg);

For string options:

- If `-cc1` accepts only `-opt string`, no need for allocation.
- If `-cc1` accepts only `-opt=string`, the fully-spelled option is on the original command-line. `ExistingStrings.find()` should turn it up when calling `Strings.save()`.
- If `-cc1` accepts both `-opt=string` and `-opt string` canonicalize to `-opt string` when generating. No need for allocation.

For numeric / enum options we'll need to allocate, but the allocation doesn't have to live past running the parser.

The only problem would be if `string` gets canonicalized / rewritten somehow during parse + generate, or if an enum value is additionally saved as a string value.

>> I’m not sure having the AllocateString API totally makes sense, vs moving in a string pool that was independently used during generation, but maybe in context?
>
> What exactly do you mean here?

The intent in this patch seemed to be to call `CompilerInvocation::AllocateString` when generating arguments. My intuition would be to continue using the StringMap we have now (or `StringSaver`, or whatever), and then `std::move()` that into a `CompilerInvocation` that's parsed from it. In which case there's no need for `CompilerInvocation::AllocateString`.

In fact, it might be even better if it was type-erased, just a payload that had to be kept alive...

  class CompilerInvocation {
    class RoundTrippedCommandLineStorage {
      std::shared_ptr<void> Storage;
    public:
      void store(BumpPtrAllocator &&Strings) {
        assert(!Storage);
        Storage = std::make_shared<BumpPtrAllocator>(std::move(Strings));
      }
    } RoundTrip;
  };



>> Consider that some callers may parse the full command-line and then drop everything but one options class. That pattern wouldn’t work anymore since the StringRefs won’t be valid.
>
> Ah, that's right. So if we don't go with the allocation-less approach, we'd need to take `StringAllocator` from the client that also ensures proper lifetimes.

Maybe it wouldn't be terrible to have a `RoundTrip` arg on each options class. When you assign to the CompilerInvocation, it copies the shared ptr into each of the options classes, so that the pool stays alive as long as one of them has it.



================
Comment at: clang/include/clang/Frontend/CompilerInvocation.h:144
+  /// Container holding strings allocated during round-trip.
+  mutable std::forward_list<std::string> AllocatedStrings;
+
----------------
jansvoboda11 wrote:
> dexonsmith wrote:
> > I suggest using llvm::StringPool instead, which also uniques the strings and is used elsewhere for this kind of thing. Probably in an Optional to make it more clear it’s not always used.
> > 
> > Maybe the name RoundTrippedStrings? 
> From the documentation of `StringPool`, it seems like we would need to store `PooledStringPtr`s in order to keep them alive. How would that work with the `const char *` API we have?
Oops; I meant `StringSet<>` / `StringMap<NoneType>` / etc., which is what I've always used in the past. This is essentially a `StringMap<void>`; I just always call the variable `StringPool`, hence my error. It has the properties you need (stable, kept alive indefinitely, null-terminated).

Looking around, there are a couple of options designed for this:
- `StringSaver`, which is an optimized version of your list (uses a bumpptrallocator, no uniquing).
- `UniqueStringSaver`, which is essentially equivalent to `StringSet<>`; might be slightly more optimized, and if it's not, we should fix it.

I suggest using one of those since they exactly match what you need.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D95514/new/

https://reviews.llvm.org/D95514



More information about the cfe-commits mailing list