[PATCH] D39352: [SimplifyCFG] Don't do if-conversion if there is a long dependence chain

Wed Dec 27 22:13:07 PST 2017

chandlerc added a comment.

I don't think this patch's approach is the right one, and we should probably pursue Eli's suggestion.... It also causes very significant regressions in benchmarks, so I think we should revert it for now.

The core problem I see here is that we're pretty radically changing the canonical form of the IR. This is going to be really disruptive. Doing this so early on means that we can't recognize common, direct patterns in terms of select. While perhaps it would be better to go and teach everything to generically reason about either a select or a phi around control flow, that will need a really massive undertaking.

And this isn't just a theoretical problem. This patch regresses performance on simple code patterns like counting things in sets by a huge amount because there, we were able to do things like use the `adc` instruction or other dedicated hardware:
https://reviews.llvm.org/P8055 or

  #include <bitset>
  #include <chrono>
  #include <iostream>
  #include <random>
  #include <vector>

  int main() {
    std::bitset<256> set;
    volatile unsigned seed = 0;
    std::mt19937 rng(seed);
    for (int i = 0; i < 256; ++i)
      set[i] = rng() < (rng.max() / 2) ? false : true;

    volatile unsigned size = 4096;
    std::vector<unsigned char> data;
    for (int i = 0; i < size; ++i)
      data.push_back(rng() & 0xFF);

    volatile int iterations = 100000;
    long int true_count = 0, all_count = 0;

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i)
      for (auto c : data) {
        if (set[c])
          ++true_count;
        ++all_count;
      }
    auto stop = std::chrono::high_resolution_clock::now();

    std::cout << "Did " << iterations << " iterations over " << size
              << " queries: " << true_count << " hits of " << all_count << "\n";
    std::cout << "Took "
              << std::chrono::duration<double, std::nano>(stop - start).count() /
                     iterations
              << " nanoseconds per iteration.\n";
  }

This benchmark runs over 2x slower for me on both AMD and Intel hardware. That seems like a really huge regression. And I have extracted this out of actual benchmark code we have internally. I think we should probably revert this patch until there is a clear consensus and/or more comprehensive performance data supporting a particular direction here.

Another way of looking at the problem with this approach is that it seems to be trying to solve an issue pretty specifically with `cmov`. We actually have a dedicated pass that tries *specifically* to reason about the length of dependency chains and the proportional cost of a `cmov` versus a conditional branch. That seems like a much more appropriate place to handle these things, and that is exactly what Eli pointed you at. I've done some work there as well, and I really think that is a good approach. If other architectures need similar logic, we could try to factor it out and share it between targets, but I think the ideas around what constitutes a long dependency will be somewhat architecture specific. For example, I could easily imagine a predicated architecture where the patterns in this patch don't form long dependency chains.

Repository:
  rL LLVM

https://reviews.llvm.org/D39352