[LLVMdev] X86TargetLowering::LowerToBT

Stephen Canon scanon at apple.com
Fri Jan 23 09:07:23 PST 2015


Right, so the xor breaks the false dependency on the previous flags state.  Compare to what we get from clang with a variable mask:

        bt	%esi, %edi
        sbb	%eax, %eax		// HAZARD — false dependency of flags state prior to BT.
        and	$1, %eax

If we instead generated:

	xor	%eax, %eax
	bt	%esi, %edi
	adc	%eax, %eax

We’d mostly avoid the partial-flags hazard, though we’d still get one extra µop generated.  Targeting Haswell, I’d probably rather see:

	shrx	%esi, %edi, %eax
	and	$1, %eax

but a reasonable case can be made for the bt sequence under –Oz.

As I understand it though, this whole discussion is actually about the constant mask case, for which clang already generates reasonable code.

– Steve

> On Jan 23, 2015, at 11:57 AM, Sanjay Patel <spatel at rotateright.com> wrote:
> 
> Full icc code sequence (for the 32-bit case):
>         xorl      %eax, %eax
>         movl      $1, %edx
>         btl       %esi, %edi
>         cmovc     %edx, %eax
>         ret       
> 
> Chris's code example is actually returning the result, so no 'test' or 'bt' in the constant mask case:
> 
> unsigned int IsBitSetA_32(unsigned int val) { return (val & (1U<<25)) != 0U; }
> 
>         andl      $33554432, %edi
>         shrl      $25, %edi
>         movl      %edi, %eax
>         ret       
> 
> 
> 
> 
> On Fri, Jan 23, 2015 at 9:45 AM, Stephen Canon <scanon at apple.com> wrote:
> I suspect that this is because the mask in your example is the result of a variable shift, which (a) has it’s own performance and flags hazards pre-SHLX and (b) requires additional µops to do with TEST.  I expect that ICC is putting a dummy TEST or XOR ahead of the BT to break the false flags dependency, as well.
> 
> If the mask were constant, I expect ICC would generate TEST instead (but I don’t have it handy to check).
> 
> – Steve
> 
>> On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at rotateright.com> wrote:
>> 
>> If 'bt' is a perf sin, icc doesn't seem to know it:
>> 
>> $ icc -v 
>> icc version 15.0.1 (gcc version 4.9.0 compatibility)
>> 
>> $ cat bt.c
>> unsigned long long IsBitSetB_64(unsigned long long val, int index) { return (val & (1ULL<<index)) != 0ULL; } 
>> unsigned int IsBitSetB_32(unsigned int val, int index) { return (val & (1U<<index)) != 0U; } 
>> 
>> $ icc -O3 -S bt.c -o - | grep bt
>>     .file "bt.c"
>>         btq       %rsi, %rdi
>>         btl       %esi, %edi
>> 
>> Does anyone at Intel have guidance for us?
>> 
>> 
>> On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at gmail.com> wrote:
>> 
>> 
>> On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at gmail.com> wrote:
>> The status quo is:
>> 
>> a) 40b REX+BT instruction for the 64b case
>> b) 48b TEST for the 32b case
>> c) unless it's small TEST
>> 
>> You are currently paying a 16b penalty for TEST vs BT in the 32b case.
>> That may be worth testing the -Os flag.
>> 
>> You'll want -Oz here, Os isn't supposed to affect the runtime as much as this is going to.
>> 
>> -eric 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 





More information about the llvm-dev mailing list