[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

Thu Apr 20 19:23:08 PDT 2017

> On 21 Apr 2017, at 12:30 PM, Kaylor, Andrew <andrew.kaylor at intel.com> wrote:
> 
> I think it’s generally true that whenever branches can reliably be predicted branching is faster than a cmov that involves speculative execution, and I would guess that your assessment regarding looping on input values is probably correct.
>  
> I believe the code that actually creates most of the transformation you’re interested in here is in SelectionDAGLegalize::ExpandNode() in LegalizeDAG.cpp.  The X86 backend sets a table entry indicating that FP_TO_UINT should be expanded for these value types, but the actual expansion is in target-independent code.  This is what it looks like in the version I last fetched:
>  
>   case ISD::FP_TO_UINT: {
>     SDValue True, False;
>     EVT VT =  Node->getOperand(0).getValueType();
>     EVT NVT = Node->getValueType(0);
>     APFloat apf(DAG.EVTToAPFloatSemantics(VT),
>                 APInt::getNullValue(VT.getSizeInBits()));
>     APInt x = APInt::getSignBit(NVT.getSizeInBits());
>     (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
>     Tmp1 = DAG.getConstantFP(apf, dl, VT);
>     Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
>                         Node->getOperand(0),
>                         Tmp1, ISD::SETLT);
>     True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));
>     // TODO: Should any fast-math-flags be set for the FSUB?
>     False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
>                         DAG.getNode(ISD::FSUB, dl, VT,
>                                     Node->getOperand(0), Tmp1));
>     False = DAG.getNode(ISD::XOR, dl, NVT, False,
>                         DAG.getConstant(x, dl, NVT));
>     Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
>     Results.push_back(Tmp1);
>     break;
>   }
>  
> The tricky bit here is that this code is asking for a Select and then something else will decide whether that select should be implemented as a branch or a cmov.

Good. I had found ISD::FP_TO_UINT but had not found the target-independent code as I was digging in llvm/lib/Target/X86. I had in fact just started looking at the target-independent code after realising it was likely not target specific. This issue could potentially effect any hard float target with IEEE-754 accrued exceptions and conditional moves as the unconditional FSUB will set INEXACT.

I can see comments in lib/Target/X86//X86ISelLowering.cpp LowerSELECT regarding selection of branch or cmov and wonder if the DAG can be matched there or whether the fix is in target-independent code.

It seems like a SELECT node with any sufficiently large number of child nodes should use a branch instead of a conditional move. I wonder about the cost model for predicate logic  and cmov. Modern branch predictors are actually pretty good so if LLVM X86 is using predication when the cost of a branch is less it could result in a loss of performance. I’m now curious about more general possibility of controlling whether SELECT is lowered to branches or predication using cmov. Can this be controlled? Anecdotally, the RISC-V CPU architects recommend branches over predicate logic as in their case (Rocket) branch mis-predict is only 3 cycles.

BTW - semi off-topic. The RISC-V interpreter I am working on seems to be a pathological test case for the LLVM/Clang optimiser (-O3) compared with GCC (-O3) with LLVM/Clang producing code that runs nearly twice as slow as GCC. I don’t know exactly what I’ve done for this to happen; too many switch statements I suspect. Branchy code versus predication perhaps? Branchiness might also explain GCC’s lead on SciMark Monte Carlo assuming Monte Carlo is branchy. Now I am guessing, although after some googling I see that clang generates x86_64 asm that prefers predication versus branches in gcc. Note this CPU simulator test requires the RISC-V GCC toolchain to be installed.

Here is a step by step for anyone interested in a pathological optimiser test case for Clang:

- https://github.com/riscv/riscv-gnu-toolchain/ <https://github.com/riscv/riscv-gnu-toolchain/>
- https://github.com/michaeljclark/riscv-meta/ <https://github.com/michaeljclark/riscv-meta/>

$ git clone https://github.com/riscv/riscv-gnu-toolchain.git
$ git clone https://github.com/michaeljclark/riscv-meta.git
$ cd riscv-gnu-toolchain
$ export RISCV=/opt/riscv-gnu-toolchain
$ ./configure --prefix=$RISCV
$ make
$ cd ..
$ cd riscv-meta
$ git submodule update --init --recursive
$ export RISCV=/opt/riscv-gnu-toolchain
$ make -j4 CXX=g++ V=1
$ make test-build
$ time ./build/linux_x86_64/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m28.280s
user	0m28.280s
sys	0m0.000s

$ make clean
$ make -j4 CXX=clang++-3.9 V=1
$ make test-build
$ time ./build/linux_x86_64/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m52.533s
user	0m52.532s
sys	0m0.000s

$ g++ --version
g++ (Debian 6.3.0-6) 6.3.0 20170205
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ clang++-3.9 --version
clang version 3.9.0-6 (tags/RELEASE_390/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

There is also a RISC-V -> x86_64 JIT engine (x86_64 JIT currently for the RISC-V integer ISAt, hard float coming soon…):

$ time ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-sha512
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m0.838s
user	0m0.840s
sys	0m0.000s

Clang and GCC produce typical native code that performs the same.

$ clang -O3 src/test/test-sha512.c -o test-sha512
$ time ./test-sha512 
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m0.285s
user	0m0.280s
sys	0m0.004s

$ gcc -O3 src/test/test-sha512.c -o test-sha512 
$ time ./test-sha512 
ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2

real	0m0.285s
user	0m0.284s
sys	0m0.000s

Michael.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170421/f47e7108/attachment.html>