[llvm-dev] Should llvm optimize 1.0 / x ?

Tue Sep 1 00:05:11 PDT 2020

On 9/1/20 1:44 AM, Alexandre Bique via llvm-dev wrote:
> Hi Quentin,
>
> You are correct, I could manage to get clang to use vrcpps, but not in
> a satisfying way:
>
> clang++ -O3 -march=native -mtune=native \
> -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
> -Rpass-analysis=loop-vectorize \
> -ffast-math -ffp-model=fast -ffp-exception-behavior=ignore -ffp-contract=fast \
> -c -o vec.o vec.cc
>
> 0000000000000140 <_Z4fct4Dv4_f>:
>   140: c5 f8 53 c8          vrcpps %xmm0,%xmm1
>   144: c4 e2 79 18 15 00 00 vbroadcastss 0x0(%rip),%xmm2        # 14d
> <_Z4fct4Dv4_f+0xd>
>   14b: 00 00
>   14d: c4 e2 71 ac c2        vfnmadd213ps %xmm2,%xmm1,%xmm0
>   152: c4 e2 71 98 c1        vfmadd132ps %xmm1,%xmm1,%xmm0
>   157: c3                    retq
>   158: 0f 1f 84 00 00 00 00 nopl   0x0(%rax,%rax,1)
>   15f: 00
>
> 0000000000000160 <_Z4fct5Dv4_f>:
>   160: c5 f8 53 c0          vrcpps %xmm0,%xmm0
>   164: c3                    retq
>
> As you can see, fct4 is not equivalent to fct5.

Perhaps it's better ;)

It looks like the compiler has generated one Newton iteration after the 
estimate to increase the precision of the answer. The reciprocal 
estimate is, after all, only an estimate, and for many applications, is 
not sufficient on its own.

This behavior is generally adjustable. Try using -mrecip=vec-divf:0 (or 
-mrecip=all:0) to turn off all of the Newton iterations.

  -Hal

>
> Regards,
> Alexandre Bique
>
> On Tue, Sep 1, 2020 at 12:59 AM Quentin Colombet <qcolombet at apple.com> wrote:
>> Hi Alexandre,
>>
>> Have you tried to compile this with fast-math enabled (`-ffast-math` https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior)?
>>
>> I would expect LLVM to require the `arcp` flag to perform this optimization (https://www.llvm.org/docs/LangRef.html#fast-math-flags).
>>
>> Cheers,
>> -Quentin
>>
>>
>>> On Aug 31, 2020, at 2:21 PM, Alexandre Bique via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>>
>>> Hi,
>>>
>>> Here is a small C++ program:
>>>
>>> vec.cc:
>>>
>>> #include <cmath>
>>>
>>> using v4f32 = float __attribute__((__vector_size__(16)));
>>>
>>> v4f32 fct1(v4f32 x)
>>> {
>>>   return 1.0 / x;
>>> }
>>>
>>> v4f32 fct2(v4f32 x)
>>> {
>>>   return __builtin_ia32_rcpps(x);
>>> }
>>>
>>> Which is compiled to:
>>>
>>> vec.o:     file format elf64-x86-64
>>>
>>>
>>> Disassembly of section .text:
>>>
>>> 0000000000000000 <_Z4fct1Dv4_f>:
>>>    0: c4 e2 79 18 0d 00 00 vbroadcastss 0x0(%rip),%xmm1        # 9
>>> <_Z4fct1Dv4_f+0x9>
>>>    7: 00 00
>>>    9: c5 f0 5e c0          vdivps %xmm0,%xmm1,%xmm0
>>>    d: c3                    retq
>>>    e: 66 90                xchg   %ax,%ax
>>>
>>> 0000000000000010 <_Z4fct2Dv4_f>:
>>>   10: c5 f8 53 c0          vrcpps %xmm0,%xmm0
>>>   14: c3                    retq
>>>
>>>
>>> As you can see, 1.0 / x is not turned into vrcpps. Is it because of
>>> precision or a missing optimization?
>>>
>>> Regards,
>>> --
>>> Alexandre Bique
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory