[llvm] [TaiDup] Allow large number of predecessors/successors without phis. (PR #116072)

Wed Jan 22 14:46:10 PST 2025

fhahn wrote:

> > > Are these inputs computed gotos or jump tables? If they are computed gotos, I think we can land #114990.
> > > In fact, I don’t see any difference between computed gotos and jump tables—both can include or exclude many PHIs. To me, the only difference is that a jump table has one extra jump at the source level compared to a computed goto. So, I believe that when users use computed gotos, they expect longer compile times and larger code size.
> > 
> > 
> > Ah I missed the other PR, thanks! For this particular case, it is a computed GOTO, but there are other computed GOTOs, but completely removing the cutoff increases the # of instructions by 10%, without any gain.
> 
> I think this is related to the specific performance of CPU branch prediction, which is why sometimes we cannot observe performance improvements.
> 
> > In general, I think we should try to avoid cut-offs if there are cases we can handle with reasonable compile-times. I am not sure if completely ignoring the cutoff for computed GOTOs is the best way forward, as I don't see a fundamental difference between coming from jump tables or compute GOTOs.
> 
> For me, we can view this transformation as **restoring the CFG of the code written by the user**. This is the only way I can see the difference between jump tables (switch) and computed GOTOs. That I'm saying, when users use computed GOTOs, they accept longer compile times and larger code size.
> 
> > Ideally we would address allow cases we can reasonably handle (e.g. because there are no phis to add) and/or by addressing the extra complexity for adding the additional edges to the CFG.
> 
> For me, it makes sense to apply this to jump tables under some limited.
> 
> > For the test-case from #106846 (http://www.jikos.cz/~mikulas/testcases/clang/computed-goto.c), tail folding is applied with the current patch as well, but only the inner level I think. I don't have a suitable X86 to test if it restores the original performance
> 
> IMO, the current patch addresses some specific scenarios. In fact, I believe the main compile time is spent on the PHI nodes added after duplication. As we see it, there is no fundamental difference between jump tables and computed GOTOs, so I think this patch is odd that we still prevent some computed GOTOs.

IIUC https://github.com/llvm/llvm-project/pull/114990 in the current version only allows duplication for computed GOTOs? With this patch, there's no extra restrictions on the types we can fold, we just skip/relax the aggressive cut-offs when it won't hurt compile-time (much).

It would be great if we could get this resolved one way or another for the Clang 20 release, as this at the moment causes a 2-3% performance loss for Python workloads on ARM64 :) I rebased and tested https://github.com/llvm/llvm-project/pull/114990 and unfortunately it doesn't yield the same perf gain ( only in the noise/ < 0.5%) 

https://github.com/llvm/llvm-project/pull/116072