[PATCH] D127115: [RFC][DAGCombine] Make sure combined nodes are added back to the worklist in topological order.

Mon Jun 6 13:10:10 PDT 2022

deadalnix added a comment.

Hi @fhahn ,

First, thanks for taking the time to look into this.

This something that we want to do for quite some time, discussion happened many months ago as I ended up being pulled away on other thing and can only make progress on this front now. Let me try to give you the main elements that motivate this change.

First, the current traversal is implementation defined and in practice, it's a bit all over the place (the specific order depend on implementation details of the combiner as well as previous passes on the DAG). This causes a couple of problems:

- Optimization requiring more than one combine are unreliable. You can get different result when you combine 2 identical DAG depending on how the DAG was obtained int he first place, as it will change the order on which node are processed.
- Deep patterns ends up with a combinatorial explosion of cases to match for their argument because there is no guarantee that they have already been combined. This explosion make it impractical to do certain combines. This works for shallow patterns because when a node is combined, it adds its users back in the worklist, but this way of proceeding causes many node to do extra unnecessary round trip through the combiner which is wasteful (this is also why this is not really doable for deeper patterns, as the impact  on performance would be catastrophic).

It must be noted that InstCombine (which does the same thing on LLVM IR) does process the nodes topologically for this very reason. In addition, all the other passes on the DAG, such as legalization, also do process the DAG topologically for this reason. DAGCombiner is the exception here.

The main problem we face to make this change happen is that we've optimized ourselves into a local minima for years, and it's hard to get out of it. @RKSimon gave a pretty good explanation of what this is so disruptive and I copied and pasted it in the diff's description. You must note that all the optimization that break here do so because they depend on the specific order the nodes are processed in to trigger - If there weren't, they wouldn't break. While they trigger for the test cases in the test suite, there is probably code out there in the wild that is very similar for which they don't.

So what do we expect to get out of this change?

- Hopefully faster DAGCombine because less round trips are required for dependent combines.
- Reliable and predictable combines. If it can combine a given test case, it will also be able to do so for similar code in the wild.
- The ability to work with deep patterns.

Why doesn't this translate well with the current test suite?

- What @RKSimon wrote.
- There are very few deep patterns in the combiner at the moment. Some of them are known to be needed, but cannot be added in a reasonable way at the moment.
- A fair amount of work have been poured in to handle known cases where the node can be processed in different order, but a ton more exist in the wild, they just aren't in the test suite. I'm particularly affected by this one, because I implemented several of these before realizing this wasn't a tractable way to handle the problem and a more holistic approach was required. Nevertheless, we've been doing it for years so we've covered most of the obvious holes that would show up in the test suite.

But this is quite abstract, let's use a concrete exemple: https://godbolt.org/z/z6T7b4MMo . This does a large integer addition on a 32bits machine. the expected codegen is, quite trivially, something like add/adc/adc/adc, but instead we get something really bad. This is because you effectively end up with two carry chains and you cannot recombine them as it require matching deep patterns, which de facto don't work because we don't traverse the DAG is a predictable order. This also cannot be addressed earlier in the pipeline, because the problem only reveals itself post legalization. It means that LLVM is condemned to do a bad job at doing codgeen for large integer arithmetic.

I hope this sched some light on things.

================
Comment at: llvm/test/CodeGen/AArch64/sve-fixed-length-ext-loads.ll:1
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
 ; RUN: llc -aarch64-sve-vector-bits-min=128  < %s | FileCheck %s -D#VBYTES=16  -check-prefix=NO_SVE
----------------
fhahn wrote:
> auto-generating the checks here makes it hard to see what the real difference in codegen is.
I created D127118 for that purpose.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127115/new/

https://reviews.llvm.org/D127115