[llvm-dev] [cfe-dev] FE_INEXACT being set for an exact conversion from float to unsigned long long

Fri Apr 21 00:30:03 PDT 2017

> On 21 Apr 2017, at 2:23 PM, Michael Clark <michaeljclark at mac.com <mailto:michaeljclark at mac.com>> wrote:
> 
> 
>> On 21 Apr 2017, at 12:30 PM, Kaylor, Andrew <andrew.kaylor at intel.com <mailto:andrew.kaylor at intel.com>> wrote:
>> 
>> I think it’s generally true that whenever branches can reliably be predicted branching is faster than a cmov that involves speculative execution, and I would guess that your assessment regarding looping on input values is probably correct.

Yes it’s based on an assumption that val <= LLONG_MAX i.e. branch predict success, which may not always be the case, but to break branch predict it would require an unpredictable sequence of values <= LLONG_MAX and > LLONG_MAX. I was curious and microbenchmarked it:

- https://godbolt.org/g/ytgk7l <https://godbolt.org/g/ytgk7l>

Best of 10 runs on a MacBookPro Ivy Bridge Intel Core i7-3740QM

$ time fcvt-branch

real	0m0.208s
user	0m0.201s
sys	0m0.002s

$ time fcvt-cmov

real	0m0.241s
user	0m0.235s
sys	0m0.002s

>>  I believe the code that actually creates most of the transformation you’re interested in here is in SelectionDAGLegalize::ExpandNode() in LegalizeDAG.cpp.  The X86 backend sets a table entry indicating that FP_TO_UINT should be expanded for these value types, but the actual expansion is in target-independent code.  This is what it looks like in the version I last fetched:
>>  
>>   case ISD::FP_TO_UINT: {
>>     SDValue True, False;
>>     EVT VT =  Node->getOperand(0).getValueType();
>>     EVT NVT = Node->getValueType(0);
>>     APFloat apf(DAG.EVTToAPFloatSemantics(VT),
>>                 APInt::getNullValue(VT.getSizeInBits()));
>>     APInt x = APInt::getSignBit(NVT.getSizeInBits());
>>     (void)apf.convertFromAPInt(x, false, APFloat::rmNearestTiesToEven);
>>     Tmp1 = DAG.getConstantFP(apf, dl, VT);
>>     Tmp2 = DAG.getSetCC(dl, getSetCCResultType(VT),
>>                         Node->getOperand(0),
>>                         Tmp1, ISD::SETLT);
>>     True = DAG.getNode(ISD::FP_TO_SINT, dl, NVT, Node->getOperand(0));
>>     // TODO: Should any fast-math-flags be set for the FSUB?
>>     False = DAG.getNode(ISD::FP_TO_SINT, dl, NVT,
>>                         DAG.getNode(ISD::FSUB, dl, VT,
>>                                     Node->getOperand(0), Tmp1));
>>     False = DAG.getNode(ISD::XOR, dl, NVT, False,
>>                         DAG.getConstant(x, dl, NVT));
>>     Tmp1 = DAG.getSelect(dl, NVT, Tmp2, True, False);
>>     Results.push_back(Tmp1);
>>     break;
>>   }
>>  
>> The tricky bit here is that this code is asking for a Select and then something else will decide whether that select should be implemented as a branch or a cmov.
> 
> Good. I had found ISD::FP_TO_UINT but had not found the target-independent code as I was digging in llvm/lib/Target/X86. I had in fact just started looking at the target-independent code after realising it was likely not target specific. This issue could potentially effect any hard float target with IEEE-754 accrued exceptions and conditional moves as the unconditional FSUB will set INEXACT.
> 
> I can see comments in lib/Target/X86//X86ISelLowering.cpp LowerSELECT regarding selection of branch or cmov and wonder if the DAG can be matched there or whether the fix is in target-independent code.
> 
> It seems like a SELECT node with any sufficiently large number of child nodes should use a branch instead of a conditional move. I wonder about the cost model for predicate logic  and cmov. Modern branch predictors are actually pretty good so if LLVM X86 is using predication when the cost of a branch is less it could result in a loss of performance. I’m now curious about more general possibility of controlling whether SELECT is lowered to branches or predication using cmov. Can this be controlled? Anecdotally, the RISC-V CPU architects recommend branches over predicate logic as in their case (Rocket) branch mis-predict is only 3 cycles.
> 
> BTW - semi off-topic. The RISC-V interpreter I am working on seems to be a pathological test case for the LLVM/Clang optimiser (-O3) compared with GCC (-O3) with LLVM/Clang producing code that runs nearly twice as slow as GCC. I don’t know exactly what I’ve done for this to happen; too many switch statements I suspect. Branchy code versus predication perhaps? Branchiness might also explain GCC’s lead on SciMark Monte Carlo assuming Monte Carlo is branchy. Now I am guessing, although after some googling I see that clang generates x86_64 asm that prefers predication versus branches in gcc. Note this CPU simulator test requires the RISC-V GCC toolchain to be installed.
> 
> Here is a step by step for anyone interested in a pathological optimiser test case for Clang:
> 
> - https://github.com/riscv/riscv-gnu-toolchain/ <https://github.com/riscv/riscv-gnu-toolchain/>
> - https://github.com/michaeljclark/riscv-meta/ <https://github.com/michaeljclark/riscv-meta/>
> 
> $ git clone https://github.com/riscv/riscv-gnu-toolchain.git <https://github.com/riscv/riscv-gnu-toolchain.git>
> $ git clone https://github.com/michaeljclark/riscv-meta.git <https://github.com/michaeljclark/riscv-meta.git>
> $ cd riscv-gnu-toolchain
> $ export RISCV=/opt/riscv-gnu-toolchain
> $ ./configure --prefix=$RISCV
> $ make
> $ cd ..
> $ cd riscv-meta
> $ git submodule update --init --recursive
> $ export RISCV=/opt/riscv-gnu-toolchain
> $ make -j4 CXX=g++ V=1
> $ make test-build
> $ time ./build/linux_x86_64/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512
> ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m28.280s
> user	0m28.280s
> sys	0m0.000s
> 
> $ make clean
> $ make -j4 CXX=clang++-3.9 V=1
> $ make test-build
> $ time ./build/linux_x86_64/bin/rv-sim build/riscv64-unknown-elf/bin/test-sha512
> ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m52.533s
> user	0m52.532s
> sys	0m0.000s
> 
> $ g++ --version
> g++ (Debian 6.3.0-6) 6.3.0 20170205
> Copyright (C) 2016 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> $ clang++-3.9 --version
> clang version 3.9.0-6 (tags/RELEASE_390/final)
> Target: x86_64-pc-linux-gnu
> Thread model: posix
> InstalledDir: /usr/bin
> 
> 
> There is also a RISC-V -> x86_64 JIT engine (x86_64 JIT currently for the RISC-V integer ISAt, hard float coming soon…):
> 
> $ time ./build/linux_x86_64/bin/rv-jit build/riscv64-unknown-elf/bin/test-sha512
> ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m0.838s
> user	0m0.840s
> sys	0m0.000s
> 
> 
> Clang and GCC produce typical native code that performs the same.
> 
> $ clang -O3 src/test/test-sha512.c -o test-sha512
> $ time ./test-sha512 
> ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m0.285s
> user	0m0.280s
> sys	0m0.004s
> 
> $ gcc -O3 src/test/test-sha512.c -o test-sha512 
> $ time ./test-sha512 
> ebdd6f20865ff41e3613b633b93c9b89c15d58fd9d64497f5b22554a7fe33757357cfa622f6fb4f40beadc02d18539ecd79e2da126b662839d296c41acbc2
> 
> real	0m0.285s
> user	0m0.284s
> sys	0m0.000s
> 
> Michael.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170421/9e4f7127/attachment-0001.html>