# [llvm-dev] The most efficient way to implement an integer based power function pow in LLVM

mats petersson via llvm-dev llvm-dev at lists.llvm.org
Fri Jan 13 03:39:52 PST 2017

```Is the original question here not about pow(int, int), rather than
pow(float, int)? In which case the problem about rounding or precision
becomes less of an issue (of course, for larger values, there will be
overflows instead - but that's still a problem if you do `int a = pow(x, y)
where x**y is out of range for the int value).

In my experience, it's not at all unlikely to find code like `int a =
pow(2, 3);` in code, often with the resulting questions about "why is my
result 7, not 8", because even with 0.5 ulp, it's not (always) EXACTLY the
integer value that comes out of the function...

--
Mats

On 13 January 2017 at 08:32, Francois Fayard via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Bare in mind that writing a very efficient x^n is not that
> straightforward. Even if it’s a corner cases, the example of x^7 is
> interesting. If you apply the divide and rule algorithm, you get the
> following algorithm:
>
> double pow_7(double x) {
>   const double x2 = x * x;
>   const double x3 = x * x2;
>   const double x6 = x3 * x3;
>   const double x7 = x * x6;
>
>   return x7;
> }
>
> But you can also write
>
> double pow_7(double x) {
>   const double x2 = x * x;
>   const double x3 = x * x2;
>   const double x4 = x2 * x2;
>   const double x7 = x3 * x4;
>
>   return x7;
> }
>
> which is a different algorithm using the same number of application. But,
> because of instruction level parallelism, they don’t get the same
> performance. In the first implementation every single instruction has to
> wait for the previous instruction to start. In the second one x3 and x4
> could be computed using instruction level parallelism as they don’t depend
> upon each other.
>
> The performance difference in between the two are (compiled with -O3 and
> -mavx2):
> - clang 3.9.1: The first one is 25% slower than the second one
> - intel 17.0.1: The first one is 05% slower than the second one
> - gcc 6.2.0: The first one is 3% slower than the first one (but both are 2
> times slower than clang and intel)
>
> Attached is the code asserting my claim.
>
> François Fayard
>
> =====
>
> #include <chrono>
> #include <iostream>
>
> int main() {
>   const int n = 1000000000;
>   const int length = 10;
>   double* x = new double[length];
>   for (int i = 0; i < length; ++i) {
>     x[i] = 1.0;
>   }
>
>   asm volatile("" : : "g"(x) : "memory");
>
>   auto begin = std::chrono::high_resolution_clock::now();
>   for (int k = 0; k < n; ++k) {
>     for (int i = 0; i < length; ++i) {
>       const double x1 = x[i];
>       const double x2 = x1 * x1;
>       const double x3 = x1 * x2;
>       const double x4 = x2 * x2;
>       const double x7 = x3 * x4;
>       x[i] = x7;
>     }
>   }
>   auto end = std::chrono::high_resolution_clock::now();
>   double time_fast =
>       1.0e-9 *
>       std::chrono::duration_cast<std::chrono::nanoseconds>(end -
> begin).count();
>
>   begin = std::chrono::high_resolution_clock::now();
>   for (int k = 0; k < n; ++k) {
>     for (int i = 0; i < length; ++i) {
>       const double x1 = x[i];
>       const double x2 = x1 * x1;
>       const double x3 = x1 * x2;
>       const double x6 = x3 * x3;
>       const double x7 = x1 * x6;
>       x[i] = x7;
>     }
>   }
>   end = std::chrono::high_resolution_clock::now();
>   double time_divide_and_rule =
>       1.0e-9 *
>       std::chrono::duration_cast<std::chrono::nanoseconds>(end -
> begin).count();
>
>   std::cout << "Optimized x^7: "
>             << 1.0e9 * time_fast / (static_cast<double>(n) * length) << "
> ns"
>             << std::endl;
>   std::cout << "  Classic x^7: "
>             << 1.0e9 * time_divide_and_rule / (static_cast<double>(n) *
> length)
>             << " ns" << std::endl;
>
>   delete[] x;
>
>   return 0;
> }
>
> =====
>
> > On Jan 9, 2017, at 6:43 PM, Wei Ding via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > Hi,
> >
> > I want an efficient way to implement function pow in LLVM instead of
> invoking pow() math built-in. For algorithm part, I am clear for the logic.
> But I am not quite sure for which parts of LLVM should I replace built-in
> pow with another efficient pow implementation. Any comments and feedback
> are appreciated. Thanks!
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170113/51caf9be/attachment.html>
```