[PATCH] D74094: Reapply: [IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas

Mon Sep 18 15:41:49 PDT 2023

nickdesaulniers added a comment.

(Sorry for the wall of text)

In D74094#4646297 <https://reviews.llvm.org/D74094#4646297>, @tellenbach wrote:

> No real comment on the issue itself but on the example as a former Eigen maintainer (sorry for the noise if that's all obvious for you):

Woah! Deus Ex Machina?

>   auto round (Tensor m) {
>       return (m + 0.5f).cast<int>().cast<float>();
>   }
>
> does not return a Tensor but an expression encoding the computation which is evaluated during the assignment `const Tensor t3 = round(a.log().exp());` using an overloaded `operator=`. With "evaluation" I mean evaluation the in the sense of performing the intended mathematical computation.
>
> Many things can go wrong when using `auto` in such cases, of which two seem to be relevant here:
>
> 1. Eigen can (this is an implementation detail(!)) decide to evaluate subexpressions into temporaries. The returned expression would then reference the result of such an evaluation beyond its lifetime.
> 2. If 1. does not happen, the unevaluated expression would reference its arguments. The immediate `0.5f` is copied since that's cheap, but the tensor would be referenced. Your example function `round` takes its parameter by-value and the returned expression would reference it. I'm unsure if the lifetime would be extended in this case (again, the exact details of C++ lifetime rules are not my area of expertise). I think if the lifetime would be extended, ASAN complaining after applying this patch is wrong, if the lifetime is not extended ASAN should complain and the example is wrong.
>
> Btw, these issues are so common that Eigen documents the recommendation to avoid using `auto` with Eigen types all together: https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3

Thanks for the info, https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3 is quite helpful.  That helps explain why the return value was deduced to a `const Eigen::TensorConversionOp<....`. Sounds like this is some kind of lazy evaluation?

Of note, I modified the reduced test case to use a static function with `auto` return type. But the code as written originally used a lambda!

  auto round = [](Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
  }

So the original author was well intentioned; generally you want to use `auto` rather than the full type of the lambda.  But the author may not have been aware of the advice or hazards described by https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3.  Our fix which we've committed is to use the `->` for the return type of the lambda.

  -auto round = [](Tensor m) {
  +auto round = [](Tensor m) -> Tensor {
     return (m + 0.5f).cast<int>().cast<float>();
   }

Which I think is inline wrt. the advice in https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3. (i.e. evaluate the expression so that we don't have some sort of partially evaluated calculation containing references to temporaries).  So I think the code in question (that we reverted this over) was just wrong.

One thing that's problematic is that neither clang or gcc do a great job with `-Wreturn-stack-address` (clang) or `-Wreturn-local-addr` (GCC): https://godbolt.org/z/6oq5nKnY7 Clang only spots obviously incorrect examples; GCC depends on optimization level.

I wonder if there's anything more we can do to help spot these at compile-time during semantic analysis.  It seems for more complex types, it could even be very beneficial.  The examples in eigen come to mind (could we catch those, for example?)

---

In D74094#4646954 <https://reviews.llvm.org/D74094#4646954>, @rjmccall wrote:

> Okay, thanks, I can see how that works, and I definitely would never had figured that out from just looking at this code.  The formal lifetime of a parameter of type `Tensor` is definitely in the land of implementation-defined behavior, so this code seems to be at best non-portable.

Right, so that was my first question here; sounds like the optimization is correct wrt. the example that broke.  Though I still could not tell you precisely why.  Perhaps, "`round` returns an object, but one that contains references to temporary variables`."

> Nick, maybe we can take a new look at this patch from that perspective.  You've been trying to use very tight lifetime bounds for these objects that only cover the duration of call, which essentially means you're defining the lifetime of parameter objects to just be the call rather than the full-expression (at least, when the parameter type doesn't have a destructor).  In the abstract, that's probably the right rule for Clang to adopt, because I think it matches programmer intuition (it's always wrong to return the address of a local, and that includes by-value parameters), and it means we'd have a consistent rule for all types and targets.  It's also a fairly aggressive rule that's likely to uncover a fair amount of code like this that's assuming longer lifetimes.  I think it's reasonable to say that these examples are all program bugs that should be fixed, but it might be better to pursue that as a *refinement* on top of an initially more conservative rule.  I suspect that that conservative rule will also give us a large fraction of the optimization benefit that we can expect from this change, because at least parameter objects from calls in different full-expressions will be able to share memory.  So we can start by giving these objects full-expression lifetime, chase down any program bugs that that uncovers (which will all *unquestionably* be program bugs under the standard), and then gradually work on landing the more aggressive rule (perhaps even for non-trivial types).

Let me see if I can picture how the lifetimes are expected to work, for my own sake.  Let's say we have functions `foo`, `bar`, and `baz` that all take a `struct widget` by value and return a `struct widget`.  Then let's say we have code like:

  struct widget { long x, y, z, w; };

  widget foo(widget);
  widget bar(widget);
  widget baz(widget);

  void quux () {
      widget w = foo(bar(baz({ .x = 42, })));
  }

With b7f4915644844fb9f32e8763922a070f5fe4fd29 <https://reviews.llvm.org/rGb7f4915644844fb9f32e8763922a070f5fe4fd29> reverted, we'd get:

  define dso_local void @_Z4quuxv() local_unnamed_addr #0 {
  entry:
    %w = alloca %struct.widget, align 8
    %agg.tmp = alloca %struct.widget, align 8
    %agg.tmp1 = alloca %struct.widget, align 8
    %agg.tmp2 = alloca %struct.widget, align 8
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(32) %agg.tmp2, i8 0, i64 32, i1 false)
    store i64 42, ptr %agg.tmp2, align 8, !tbaa !5
    call void @_Z3baz6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp2)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @_Z3bar6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @_Z3foo6widget(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
    ret void
  }

which //is// very aggressive; each temporary is marked dead for the next non-intrinsic `call`. This looks like it would be optimal.  IIUC, @rjmccall 's proposal is instead to do something a la:

  define dso_local void @_Z4quuxv() local_unnamed_addr #0 {
  entry:
    %w = alloca %struct.widget, align 8
    %agg.tmp = alloca %struct.widget, align 8
    %agg.tmp1 = alloca %struct.widget, align 8
    %agg.tmp2 = alloca %struct.widget, align 8
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(32) %agg.tmp2, i8 0, i64 32, i1 false)
    store i64 42, ptr %agg.tmp2, align 8, !tbaa !5
    call void @_Z3baz6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp2)
    call void @_Z3bar6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1)
    call void @_Z3foo6widget(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp2) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
    ret void
  }

i.e. to have a less aggressive optimization where the lifetimes extend to the full expression, and the temporaries would not be able to share stack slots.  That's better than the status quo (no lifetime markers at all) but still suboptimal, and I don't think it will solve the particular case I care about (a particular Linux kernel driver written in C which is passing many aggregates by value).

Do I want to create a flag for that?  Not particularly; it does mean supporting BOTH codegen patterns; I'd rather have a flag to control the optimization or not do it at all rather than support 2 different optimizations.  I'm also curious whether you'd expect the default value of the flag to be on (more aggressive) or not (less aggressive)?  I don't want to have to add `-Xclang` (or `-mllvm`) flags to my build system to opt into secret optimizations.  If folks find that this optimization breaks their incorrect code, they can use the already existing`-Xclang -disable-lifetime-markers` to opt out of optimizations (though more optimizations than just those introduced by this change). I like that pattern better, since using those flags is itself a red flag to be scrutinized during code review.  Then again GCC does have `-fconserve-stack` that clang does not; we could put more aggressive stack saving optimizations under that, though IME in GCC that flag limits inline substitution to a magic per-target-architecture number of bytes. Or hide it under `-O3` perhaps?

At the least, a release note to at least acknowledge that "you're not crazy; clang-18 is now more precise about minimizing the lifetime of temporaries (since the C++ spec says it's implementation defined) in order to reduce stack usage. ASAN is your friend. If the code works under `-Xclang -disable-lifetime-markers` then this is likely the culprit." would be an improvement if we were to reland this.

Personally, I feel that "we found a buggy example of code we broke, we fixed it. Let's reland it and tell people to be careful when upgrading."  But I haven't been around long enough to know what's the precedent here for "aggressive optimizations."  Do you feel strongly that we should not just reland this, @rjmccall (or anyone else)?

---

Converting the above `widget` example to C, I noticed that we're also missing the lifetime markers on compound literals!

  struct widget { long x, y, z, w; };

  struct widget foo(struct widget);
  struct widget bar(struct widget);
  struct widget baz(struct widget);

  void quux () {
      struct widget w = foo(bar(baz((struct widget){ .x = 42, })));
  }

->

  define dso_local void @quux() local_unnamed_addr #0 {
  entry:
    %w = alloca %struct.widget, align 8
    %agg.tmp = alloca %struct.widget, align 8
    %agg.tmp1 = alloca %struct.widget, align 8
    %.compoundliteral = alloca %struct.widget, align 8    ; <--- NO LIFETIME MARKERS :(
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(32) %.compoundliteral, i8 0, i64 32, i1 false)
    store i64 42, ptr %.compoundliteral, align 8, !tbaa !5
    call void @baz(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %.compoundliteral) #4
    call void @bar(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
    call void @foo(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
    ret void
  }

---

One thing I noticed, simply printing the expression when we were about to add the lifetime cleanup; I noticed some examples would trigger on `CXXConstructExpr`, `CXXTemporaryObjectExpr`, and `CXXOperatorCallExpr`.  Those feel very C++ specific; I do wonder if we could do something differently for C, but I think we should not since as @rjmccall said "we'd have a consistent rule for all types and targets" which IMO is nicer+simpler.

---

I tried to write a verifier check to detect when a use was dominated by a call to the lifetime end intrinsic.  I don't think that's what's going wrong here, but my initial hypothesis was that we were somehow getting this wrong in codegen, which would lead to bugs.

This failed on multiple examples in tree; llvm/test/CodeGen/AArch64/stack-tagging.ll has some examples that surprised me. Such as values with lifetimes that restart after being ended:

  call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %x)
  call void @use8(ptr %x) #3
  call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %x)

  call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %x)
  call void @use8(ptr %x) #3
  call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %x)

which makes it so that we can't simply use a single dominance check to check for uses after lifetime has ended  (though I guess this is just a kind of bounds check; maybe if the use is then additionally dominated by a lifetime start and the start is dominated by the end then we're "resurrected.").  Perhaps a separate discussion orthogonal to this bug, but after reading

- https://llvm.org/docs/LangRef.html#llvm-lifetime-start-intrinsic
- https://llvm.org/docs/LangRef.html#llvm-lifetime-end-intrinsic

another question unresolved in my mind is "why do we even need intrinsics for lifetime? why can't we use the initial use as the implicit lifetime start, and the final use as the implicit lifetime end?"  I'm certain this is a naive question that probably depends on the semantics of a specific language frontend in particular (or even ABI; perhaps that's why the C++ spec mentions "implementation defined."  I can't recall where I saw this, but for some reason I suspect the ABI used for Microsoft platforms differs here?).  I also don't understand under what circumstances that something is marked dead can it later be "brought back to life/resurrected."  Why is that a thing?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74094/new/

https://reviews.llvm.org/D74094