[llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

Francois Fayard via llvm-dev llvm-dev at lists.llvm.org
Fri Jan 13 00:32:33 PST 2017


Bare in mind that writing a very efficient x^n is not that straightforward. Even if it’s a corner cases, the example of x^7 is interesting. If you apply the divide and rule algorithm, you get the following algorithm:

double pow_7(double x) {
  const double x2 = x * x;
  const double x3 = x * x2;
  const double x6 = x3 * x3;
  const double x7 = x * x6;

  return x7;
}

But you can also write

double pow_7(double x) {
  const double x2 = x * x;
  const double x3 = x * x2;
  const double x4 = x2 * x2;
  const double x7 = x3 * x4;

  return x7;
}

which is a different algorithm using the same number of application. But, because of instruction level parallelism, they don’t get the same performance. In the first implementation every single instruction has to wait for the previous instruction to start. In the second one x3 and x4 could be computed using instruction level parallelism as they don’t depend upon each other.

The performance difference in between the two are (compiled with -O3 and -mavx2):
- clang 3.9.1: The first one is 25% slower than the second one
- intel 17.0.1: The first one is 05% slower than the second one
- gcc 6.2.0: The first one is 3% slower than the first one (but both are 2 times slower than clang and intel)

Attached is the code asserting my claim.

François Fayard

=====

#include <chrono>
#include <iostream>

int main() {
  const int n = 1000000000;
  const int length = 10;
  double* x = new double[length];
  for (int i = 0; i < length; ++i) {
    x[i] = 1.0;
  }

  asm volatile("" : : "g"(x) : "memory");

  auto begin = std::chrono::high_resolution_clock::now();
  for (int k = 0; k < n; ++k) {
    for (int i = 0; i < length; ++i) {
      const double x1 = x[i];
      const double x2 = x1 * x1;
      const double x3 = x1 * x2;
      const double x4 = x2 * x2;
      const double x7 = x3 * x4;
      x[i] = x7;
    }
  }
  auto end = std::chrono::high_resolution_clock::now();
  double time_fast =
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count();

  begin = std::chrono::high_resolution_clock::now();
  for (int k = 0; k < n; ++k) {
    for (int i = 0; i < length; ++i) {
      const double x1 = x[i];
      const double x2 = x1 * x1;
      const double x3 = x1 * x2;
      const double x6 = x3 * x3;
      const double x7 = x1 * x6;
      x[i] = x7;
    }
  }
  end = std::chrono::high_resolution_clock::now();
  double time_divide_and_rule =
      1.0e-9 *
      std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count();

  std::cout << "Optimized x^7: "
            << 1.0e9 * time_fast / (static_cast<double>(n) * length) << " ns"
            << std::endl;
  std::cout << "  Classic x^7: "
            << 1.0e9 * time_divide_and_rule / (static_cast<double>(n) * length)
            << " ns" << std::endl;

  delete[] x;

  return 0;
}

=====

> On Jan 9, 2017, at 6:43 PM, Wei Ding via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi,
> 
> I want an efficient way to implement function pow in LLVM instead of invoking pow() math built-in. For algorithm part, I am clear for the logic. But I am not quite sure for which parts of LLVM should I replace built-in pow with another efficient pow implementation. Any comments and feedback are appreciated. Thanks!




More information about the llvm-dev mailing list