[LLVMdev] X86TargetLowering::LowerToBT

Mon Jan 19 10:53:12 PST 2015

> On Jan 19, 2015, at 10:29 AM, Chris Sears <chris.sears at gmail.com> wrote:
> 
> Looking at the Intel Optimization Reference Manual, page 14-14, for Atom
> 
>     BT m16, imm8, BT mem, imm8   latency 2,1 throughput 1
>     BT m16, r16, BT mem, reg          latency 10, 9, throughput 8
>     BT reg, imm8, BT reg, reg           latency 1, throughput 1
> 
> On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.
> 
> The posted functions were simplified for tracking down the code generation problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ for bit testing is even worse because you have to MOVE the tested reg to a temporary before the SHRQ/ANDQ. And all of these instructions require a REX prefix (well, not the AND). The result is some code bloat (3 instructions vs 1) and a little register pressure.

I’m not an X86 expert, but I’d still like to understand why you are comparing 1 instructions to 3, the result does not seem exactly the same since (if I understand correctly) BT only sets the carry flags while the other combination provide the result in a register. 

The full sequence is:

	btq	%rsi, %rdi
	sbbq	%rax, %rax
	andq	$1, %rax
	popq	%rbp

vs:

	shrq	$25, %rdi
	andq	$1, %rdi
	movq	%rdi, %rax
	popq	%rbp

(I’m not saying that btq is not preferable, just that I don’t see a difference in the number of instructions needed to get the result).

I agree with Ahmed that you probably should look into PerformAndCombine()( X86ISelLowering.cpp)

Best,

Mehdi