[PATCH] D47407: [LoopInstSimplify] Re-implement the core logic of loop-instsimplify to be both simpler and substantially more efficient.

Wed May 30 09:57:37 PDT 2018

chandlerc added inline comments.

================
Comment at: llvm/trunk/lib/Transforms/Scalar/LoopInstSimplify.cpp:82
+    for (BasicBlock *BB : RPOT) {
+      for (Instruction &I : *BB) {
+        if (auto *PI = dyn_cast<PHINode>(&I))
----------------
mzolotukhin wrote:
> I wonder if it would be more efficient to iterate through `ToSimplify` instead of all instructions in all blocks. What do you think?
I think it's trickier than that, but let me know if you see a way that seems more promising.

We need to visit defs before uses to reliably converge on the simplified result without revisiting instructions even more times. And we don't know all of the instructions we will need to visit, we only know the first ones. Each time we simplify an instruction we (potentially) grow the `ToSimplify` set.

So to only look at the instructions in the `ToSimplify` set, I think we'd need a system like the following:

- First build a mapping from every instruction in the loop body to an integer so that they can be cheaply sorted in RPO. We could do this in the first iteration.
- Build a sorted worklist in addition to a `ToSimplify` set, and every time we add an instruction to the `ToSimplify` set, insert it into the worklist in the correct (based on sort) position.
- Walk that worklist rather than all the instructions in the loop body.

To build a worklist like this which has good algorithmic properties, what you'd probably want is to make `ToSimplify` actually a std::set or some other ordered search tree with cheap in-order insertion because we'll constantly be inserting into all kinds of different positions.

A std::set (or similar) tree data structure combined with the std::less needing to do two hash table lookups to get the sort key seemed like it would end up having a really frustratingly high overhead, and that's why I didn't immediately jump to this solution.

Another approach which doesn't have any better *algorithmic* properties in the worst case than the current one but might have hilariously better practical properties would be as follows:

- Build a vector of instruction pointers in RPO order during the initial traversal, and a map from instruction to vector index.
- Instead of using a ToSimplify set, use a sparse bitvector where the bit represents that the instruction at this index is in the set.
- Walk the sparse bitvector in order in all subsequent iterations, and then use the index of the set bits to find the instruction at that position in the RPO.

Technically, in the worst case, this is the same as the current approach -- for a loop with N instructions that we iterate on M times we do O(N*M) work. But with this approach, that work involves looking a bit in a fairly cache-optimized data structure rather than walking to a linked list node and testing it in a hash table. And as the iteration becomes sparse, we actually get algorithmic improvements that would likely help in the average case.

The down side is that it is much more complex and requires a decent amount of memory, which is why I didn't implement it at first.

I'm happy to pursue either of these if you and others thing it is worthwhile. I'm not sure we'll ever have a test case where this shows up as a practical problem, but that doesn't mean one won't show up. =D I mostly had trouble making the trade-offs here, and this patch seemed a strict improvement.

I'll admit I'm more reluctant about the first approach as I suspect its algorithmic scaling will be completely undermined by the practical cost of maintaining the sorted data structure.

But the second approach is super appealing. I didn't think super carefully about it when I wrote this the first time to convince myself it would be effective, but I'm happy to go do this work if you think its worthwhile based on my description.

Repository:
  rL LLVM

https://reviews.llvm.org/D47407