[PATCH] D104870: [SimplifyCFG] Tail-merging all blocks with `unreachable` terminator

Thu Jul 1 12:01:00 PDT 2021

lebedev.ri marked an inline comment as done.
lebedev.ri added a comment.

Tried to update the description to talk about the motivation.

================
Comment at: llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll:129-130
 ; CHECK:       a2:
 ; CHECK-NEXT:    call void @assert_fail_1_alt(i32 0)
 ; CHECK-NEXT:    unreachable
 ;
----------------
nikic wrote:
> lebedev.ri wrote:
> > rnk wrote:
> > > lebedev.ri wrote:
> > > > lebedev.ri wrote:
> > > > > lebedev.ri wrote:
> > > > > > nikic wrote:
> > > > > > > lebedev.ri wrote:
> > > > > > > > aeubanks wrote:
> > > > > > > > > lebedev.ri wrote:
> > > > > > > > > > rnk wrote:
> > > > > > > > > > > lebedev.ri wrote:
> > > > > > > > > > > > rnk wrote:
> > > > > > > > > > > > > lebedev.ri wrote:
> > > > > > > > > > > > > > rnk wrote:
> > > > > > > > > > > > > > > lebedev.ri wrote:
> > > > > > > > > > > > > > > > rnk wrote:
> > > > > > > > > > > > > > > > > I expected your code to fire on this test case. Can you explain why this example isn't getting tail merged?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Consider this example: https://gcc.godbolt.org/z/ox16a9P1z
> > > > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > > > > [[noreturn]] void abort1();
> > > > > > > > > > > > > > > > > [[noreturn]] void abort2();
> > > > > > > > > > > > > > > > > [[noreturn]] void abort3();
> > > > > > > > > > > > > > > > > bool cond();
> > > > > > > > > > > > > > > > > void doAsserts() {
> > > > > > > > > > > > > > > > >     if (cond()) abort1();
> > > > > > > > > > > > > > > > >     if (cond()) abort2();
> > > > > > > > > > > > > > > > >     if (cond()) abort3();
> > > > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I think it is more canonical to leave these unreachable terminators in place after the calls to noreturn functions, rather than merging the unreachables together.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I just want to make sure your transform isn't firing, creating BBs, and then a later part of simplifycfg rolls the unreachables back up into place after the calls.
> > > > > > > > > > > > > > > > > I expected your code to fire on this test case. Can you explain why this example isn't getting tail merged?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > It fired, we didn't sink anything, and `SimplifyCFGOpt::simplifyUnreachable()` decided to undo it.
> > > > > > > > > > > > > > > Got it, and we want to avoid that because otherwise it will make the overall pass return true to indicate that it changed something, which will make the parent pass manager re-run more passes.
> > > > > > > > > > > > > > True.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > As far as i'm aware that only results in potentially invalidating analysises,
> > > > > > > > > > > > > > i'm not aware of that triggering another optimization pass runs.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > IIRC that part of `SimplifyCFGOpt::simplifyUnreachable()` is a pretty important canonicalization,
> > > > > > > > > > > > > > because e.g. instcombine can't modify cfg.
> > > > > > > > > > > > > Even if it only invalidates analyses, I think this is worth addressing before landing this. Ideally this code would directly call the heuristic that "sink from common predecessors" uses, but if that isn't available, I think you could approximate it by not merging unreachable terminators when the previous non-debug instruction is a noreturn call with distinct callees. We know that is unprofitable, and that accounts for most blocks ending in unreachable. It saves compile time from IR churn too.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > @aeubanks, what are the consequences of passes indicating that they changed the IR when they actually didn't?
> > > > > > > > > > > > This is impossible to address.
> > > > > > > > > > > > 
> > > > > > > > > > > > > I think you could approximate it by not merging unreachable terminators when the previous non-debug instruction is a noreturn call with distinct callees.
> > > > > > > > > > > > 
> > > > > > > > > > > > I can not, because fixing lack of sinking in such cases is basically the very next step here.
> > > > > > > > > > > > I can not, because fixing lack of sinking in such cases is basically the very next step here.
> > > > > > > > > > > 
> > > > > > > > > > > This is the transform I'm talking about avoiding, and I don't think we plan to do this in the next step:
> > > > > > > > > > > ```
> > > > > > > > > > > bb1:
> > > > > > > > > > >   call void @abort1()
> > > > > > > > > > >   unreachable
> > > > > > > > > > > bb2:
> > > > > > > > > > >   call void @abort2()
> > > > > > > > > > >   unreachable
> > > > > > > > > > > ```
> > > > > > > > > > > ->
> > > > > > > > > > > ```
> > > > > > > > > > > bb1:
> > > > > > > > > > >   br label %common
> > > > > > > > > > > bb2:
> > > > > > > > > > >   br label %common
> > > > > > > > > > > common:
> > > > > > > > > > >   %callee = phi ... @abort1 ... @abort2
> > > > > > > > > > >   call void ... %callee()
> > > > > > > > > > >   unreachable
> > > > > > > > > > > ```
> > > > > > > > > > > Right? This would make direct calls indirect, which is less canonical. I doubt this is going to change soon.
> > > > > > > > > > > 
> > > > > > > > > > > > This is impossible to address.
> > > > > > > > > > > 
> > > > > > > > > > > I guess what you are saying is that this isn't possible to implement with the current code and data structures. We'd need to incorporate the structure of the instruction before unreachable into the map. This makes me think maybe it would be better to extend the SinkFromCommonPredecessors logic to consider blocks ending in unreachable. The current code is essentially restructuring the CFG in a way that is convenient for that function. I think transforms should avoid changing the IR before they know if transformation is really profitable, and it seems like the profitability heuristic is over there.
> > > > > > > > > > > Right? This would make direct calls indirect, which is less canonical. I doubt this is going to change soon.
> > > > > > > > > > 
> > > > > > > > > > Right. This isn't going to change.
> > > > > > > > > > However, consider
> > > > > > > > > > ```
> > > > > > > > > > bb1:
> > > > > > > > > >   call void @abort1()
> > > > > > > > > >   unreachable
> > > > > > > > > > bb2:
> > > > > > > > > >   call void @abort2()
> > > > > > > > > >   unreachable
> > > > > > > > > > bb3:
> > > > > > > > > >   call void @abort2()
> > > > > > > > > >   unreachable
> > > > > > > > > > ```
> > > > > > > > > > ->
> > > > > > > > > > ```
> > > > > > > > > > bb1:
> > > > > > > > > >   call void @abort1()
> > > > > > > > > >   unreachable
> > > > > > > > > > bb2:
> > > > > > > > > >   br label %bb2.bb3.common
> > > > > > > > > > bb2.bb3.common:
> > > > > > > > > >   call void @abort2()
> > > > > > > > > >   unreachable
> > > > > > > > > > ```
> > > > > > > > > > 
> > > > > > > > > > Also, consider:
> > > > > > > > > > ```
> > > > > > > > > > bb1:
> > > > > > > > > >   call void @abort1()
> > > > > > > > > >   unreachable
> > > > > > > > > > bb2:
> > > > > > > > > >   call void @abort1()
> > > > > > > > > >   br label %bb3
> > > > > > > > > > bb3:
> > > > > > > > > >   unreachable
> > > > > > > > > > ```
> > > > > > > > > > 
> > > > > > > > > > My point being, we can't realistically say that we will/won't succeed in sinking stuff.
> > > > > > > > > there's no correctness issue with saying that we modified IR if we didn't actually, it'll just invalidate analyses, causing more work when they are recomputed in later passes
> > > > > > > > > 
> > > > > > > > > might be worth putting this through http://llvm-compile-time-tracker.com/
> > > > > > > > Since you asked, sure: https://llvm-compile-time-tracker.com/compare.php?from=1f169a774cb865659cefe085e70a56a884e3711e&to=fc54bb9a8ef85bd76dd9e934b2546f4beadc5b5e&stat=instructions
> > > > > > > > I'm not sure what this tells us here.
> > > > > > > > Since the instruction stat correlates with the size changes, i guess we could say that it lead to more inlining,
> > > > > > > > and more IR to chew through. Which is pretty much the expected outcome.
> > > > > > > This shows a 10% increase in code size on mafft with LTO and a few others also increase by multiple percent points. Did you rerun @rnk's test on clang Release+Assert code size with this patch?
> > > > > > > 
> > > > > > > It looks like large code size increases are still the blocker for this patch, as they were back then.
> > > > > > > This shows a 10% increase in code size on mafft with LTO and a few others also increase by multiple percent points.
> > > > > > 
> > > > > > Yep.
> > > > > > 
> > > > > > > Did you rerun @rnk's test on clang Release+Assert code size with this patch?
> > > > > > 
> > > > > > I have not because there is no reason to expect that the outcome is different.
> > > > > > (It will be somewhat different, because the approach is somewhat different)
> > > > > > 
> > > > > > > It looks like large code size increases are still the blocker for this patch, as they were back then.
> > > > > > 
> > > > > > I'm not quite sure how we arrive at this conclusion.
> > > > > > 
> > > > > > Let me make a comparison: when one tries to paint something,
> > > > > > it is expected that not only said something will get colored,
> > > > > > but the paint amount will use up.
> > > > > > 
> > > > > > What i'm saying is that the effect this has is not unexpected, on the contrary, it is expected.
> > > > > > We successfully decrease the amount of IR bloat by assertion blocks,
> > > > > > decreasing the size of the functions they are in,
> > > > > > and naturally that makes some of them more eligible for inlining.
> > > > > > which happens, and increases codesize.
> > > > > > 
> > > > > > I would like to also call-back to the disscussion in D101468,
> > > > > > where we had very much the same disscussion, and actually i argued that it was bad,
> > > > > > but @nikic argued that said change is good since we no longer overestimate the inlining cost,
> > > > > > and i if that leads to an overestimation, then the problem is in inliner.
> > > > > > I'm not sure how in this patch the views changed to diametrically opposite ones :)
> > > > > Forgot to mention: `"There are three kinds of lies: lies, damned lies, and statistics."`
> > > > > The n% increase in code size is a pretty meaningless number,
> > > > > and i'm sad to see it being used in such a harsh blocking manner.
> > > > > What we should at least do, is look at how it compares with assert-less code.
> > > > > 
> > > > > I don't yet have clang numbers, but here's some for RawSpeed:
> > > > > ```
> > > > > $ stat --printf="%s %n\n"  build-release-*/src/utilities/rsbench/rsbench | sort
> > > > > 17188264 build-release-new/src/utilities/rsbench/rsbench
> > > > > 17234640 build-release-old/src/utilities/rsbench/rsbench
> > > > > 17464336 build-release-with-asserts-new/src/utilities/rsbench/rsbench
> > > > > 17508840 build-release-with-asserts-old/src/utilities/rsbench/rsbench
> > > > > ```
> > > > > I.e. `-DNDEBUG`->`-UNDEBUG` is +1.6% increase,
> > > > > while `old`->`new` (i.e. this patch) causes -0.25% decrease.
> > > > > 
> > > > > Let me get these numbers for clang...
> > > > And here's clang numbers.
> > > > ```
> > > > $ stat --printf="%s %n\n" build-release-*/bin/clang-13 | sort
> > > > 103032240 build-release-old/bin/clang-13
> > > > 103046136 build-release-new/bin/clang-13
> > > > 123732704 build-release-with-asserts-old/bin/clang-13
> > > > 123882984 build-release-with-asserts-new/bin/clang-13
> > > > ```
> > > > I.e. `-DNDEBUG`->`-UNDEBUG` is +20% increase,
> > > > while old->new (i.e. this patch) causes +0.1% increase for assert-ful build, and +0.01 for assert-less one.
> > > > 
> > > > But what this really tells us is that the numbers will vary depending on the underlying libc implementation.
> > > > The thing is, glibc's `__assert_fail()` has 4 arguments (the stringified assertion, filename, line, function),
> > > > and in worst-case scenario we'll need a PHI for each one of them, yet currently the profitability check
> > > > only allows a single PHI.
> > > > 
> > > > So to reproduce @rnk's numbers, he'd have to redo the test on whatever platform used originally.
> > > > 
> > > > Another way to spell this, the regression will appear later when profitability check is tuned :)
> > > My reading of the llvm compile time tracker results is that this patch may result in a modest (~1% or less) compile time increase. That could be in the noise. The cost may mostly come from the analysis invalidation, which could be avoided if this were implemented in SinkCodeFromCommonPredecessors. I don't consider this a hard blocker if that is onerous.
> > > 
> > > > The n% increase in code size is a pretty meaningless number,
> > > > and i'm sad to see it being used in such a harsh blocking manner.
> > > 
> > > I think it is more constructive to think of the engagement here as early feedback. People will notice code size increases and provide feedback, it's just a question of when. Reviewers are trying to be helpful.
> > > 
> > > W.r.t. `__assert_fail`, consider that CodeGen tail duplication may ultimately re-duplicate all the calls to `__assert_fail`. If that's the case, we shouldn't do this transform: it would throw away source location information for no gain. The more phis we create, the more likely it is to trigger tail duplication in the backend.
> > > 
> > > However, there are many common calls to `__assert_fail` from things like the llvm::cast template. These are typically inlined and can be tail merged with one phi for the message, the file, line, and function should all be the same.
> > > 
> > > Anyway, are you set on this approach, or would you consider the proposed alternative?
> > (Note that the compile time numbers aren't final, because this lacks
> > further profitability checks relaxation in sinking logic.)
> > 
> > > My reading of the llvm compile time tracker results is that this patch may result in a modest (~1% or less) compile time increase. That could be in the noise.
> > 
> > FWIW i do agree that this will obviously affect the compile time.
> > 
> > > The cost may mostly come from the analysis invalidation, 
> > 
> > The *avoidable* portion of the cost.
> > The cost that comes from succeeding in the transformation,
> > and enabling whatever next transformation will remain.
> > 
> > I think i should mention that i believe that for any non-trivial function
> > the simplifycfg will likely already report that it changed things,
> > and for trivial things it shouldn't be too costly to recompute analysis.
> > This may be a faulty view, but that is what i think.
> > 
> > > > The n% increase in code size is a pretty meaningless number,
> > > > and i'm sad to see it being used in such a harsh blocking manner.
> > > I think it is more constructive to think of the engagement here as early feedback. People will notice code size > increases and provide feedback, it's just a question of when. Reviewers are trying to be helpful.
> > Right, i agree, though it didn't quite read as such to me,
> > but it may be again just that the tone doesn't quite always roundtrip perfectly through translation.
> > 
> > > which could be avoided if this were implemented in SinkCodeFromCommonPredecessors. I don't consider this a hard blocker if that is onerous.
> > > <...>
> > > Anyway, are you set on this approach, or would you consider the proposed alternative?
> > 
> > Hmm, wait, i lost the thought. What were implemented in where? The profitability check before tail-merging?
> > Or not doing tail-merging eagerly/early, but instead when visiting a function-terminating block,
> > scan the function for all the blocks with the same terminator and only tail-merge iff we can actually sink?
> > The latter sounds like it will result in some quadratic behavior.
> > What i'm saying is that the effect this has is not unexpected, on the contrary, it is expected.
> > We successfully decrease the amount of IR bloat by assertion blocks,
> > decreasing the size of the functions they are in,
> > and naturally that makes some of them more eligible for inlining.
> > which happens, and increases codesize.
> 
> Just to double check: Has someone confirmed that the code size increases we see are indeed caused (or caused primarily) by the inlining interaction?
> 
> > I would like to also call-back to the disscussion in D101468,
> > where we had very much the same disscussion, and actually i argued that it was bad,
> > but @nikic argued that said change is good since we no longer overestimate the inlining cost,
> > and i if that leads to an overestimation, then the problem is in inliner.
> > I'm not sure how in this patch the views changed to diametrically opposite ones :)
> 
> My view here hasn't really changed //in principle//, but the practical aspects here are quite different. D101468 had some code size wins, some losses, and an overall 0.1% regression. Here we see many regressions in the 1-10% range. D101468 was all of "the right thing to do", had a clear motivation for vectorization and had fairly limited code size impact.
> 
> For this patch, it seems like "the right thing to do" on an abstract level (in the sense that we do the same thing for other terminators), but it also has a non-trivial code size impact and the overall motivation isn't clear to me. Or at least, the patch summary doesn't say what the larger motivation here is. I would have guessed that the motivation is to reduce code size by tail merging and sinking, but if in practice the reverse happens due to inlining interaction, then I'm not sure that really makes sense. Maybe all I'm really looking for is clarification on what the motivation / bigger picture here is?
> 
> > My reading of the llvm compile time tracker results is that this patch may result in a modest (~1% or less) compile time increase. That could be in the noise. 
> 
> @rnk: The way to read the numbers is: Anything colored is likely not noise. For most benchmarks, the noise floor is <0.05%.
> > What i'm saying is that the effect this has is not unexpected, on the contrary, it is expected.
> > We successfully decrease the amount of IR bloat by assertion blocks,
> > decreasing the size of the functions they are in,
> > and naturally that makes some of them more eligible for inlining.
> > which happens, and increases codesize.

> Just to double check: Has someone confirmed that the code size increases we see are indeed caused (or caused primarily) by the inlining interaction?

I have just compared the stats for the vanilla test-suite as a whole, and 7zip specifically:
while there is an increase in assembly instruction count, and increase in tail duplication,
there is an increase in the count of IR instructions and decrease of the function/block count
at the end of middle-end pipeline, and finally more `inline.NumInlined`.
So i'm going to go with "yes".

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104870/new/

https://reviews.llvm.org/D104870