[LLVMdev] X86TargetLowering::LowerToBT
Mehdi Amini
mehdi.amini at apple.com
Mon Jan 19 10:53:12 PST 2015
> On Jan 19, 2015, at 10:29 AM, Chris Sears <chris.sears at gmail.com> wrote:
>
> Looking at the Intel Optimization Reference Manual, page 14-14, for Atom
>
> BT m16, imm8, BT mem, imm8 latency 2,1 throughput 1
> BT m16, r16, BT mem, reg latency 10, 9, throughput 8
> BT reg, imm8, BT reg, reg latency 1, throughput 1
>
> On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.
>
> The posted functions were simplified for tracking down the code generation problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ for bit testing is even worse because you have to MOVE the tested reg to a temporary before the SHRQ/ANDQ. And all of these instructions require a REX prefix (well, not the AND). The result is some code bloat (3 instructions vs 1) and a little register pressure.
I’m not an X86 expert, but I’d still like to understand why you are comparing 1 instructions to 3, the result does not seem exactly the same since (if I understand correctly) BT only sets the carry flags while the other combination provide the result in a register.
The full sequence is:
btq %rsi, %rdi
sbbq %rax, %rax
andq $1, %rax
popq %rbp
vs:
shrq $25, %rdi
andq $1, %rdi
movq %rdi, %rax
popq %rbp
(I’m not saying that btq is not preferable, just that I don’t see a difference in the number of instructions needed to get the result).
I agree with Ahmed that you probably should look into PerformAndCombine()( X86ISelLowering.cpp)
Best,
Mehdi
More information about the llvm-dev
mailing list