[LLVMdev] Scheduling quirks

Sat Jan 18 05:13:57 PST 2014

Hello all!

When I compile the following more or less stupid functions with
   clang++ -O3 -S test.cpp
===>
int test_register(int x) {
   x ^= (x >> 2);
   x ^= (x >> 3);
   x = x ^ (x >> 4);
   int y = x;  x >>= 5;  x ^= y;  // almost the same but explicit
   return x;
   }

int test_scheduler(int x) {
   return ((x>>2) & 15) ^ ((x>>3) & 31);
   }
<===
...I get the following result:
===>
	.file	"test.cpp"
	.text
	.globl	_Z13test_registeri
	.align	16, 0x90
	.type	_Z13test_registeri, at function
_Z13test_registeri:                     # @_Z13test_registeri
	.cfi_startproc
# BB#0:                                 # %entry
	movl	%edi, %eax
	sarl	$2, %eax
	xorl	%edi, %eax
	movl	%eax, %ecx
	sarl	$3, %ecx
	xorl	%eax, %ecx
	movl	%ecx, %edx
	sarl	$4, %edx
	xorl	%ecx, %edx
	movl	%edx, %eax
	sarl	$5, %eax
	xorl	%edx, %eax
	retq
.Ltmp0:
	.size	_Z13test_registeri, .Ltmp0-_Z13test_registeri
	.cfi_endproc

	.globl	_Z14test_scheduleri
	.align	16, 0x90
	.type	_Z14test_scheduleri, at function
_Z14test_scheduleri:                    # @_Z14test_scheduleri
	.cfi_startproc
# BB#0:                                 # %entry
	movl	%edi, %eax
	shrl	$2, %eax
	andl	$15, %eax
	shrl	$3, %edi
	andl	$31, %edi
	xorl	%eax, %edi
	movl	%edi, %eax
	retq
.Ltmp1:
	.size	_Z14test_scheduleri, .Ltmp1-_Z14test_scheduleri
	.cfi_endproc

	.ident	"clang version 3.5 (trunk 199507)"
	.section	".note.GNU-stack","", at progbits
<===

Now once more in detail.

The lines
   x ^= (x >> 2);
and
   x = x ^ (x >> 4);
and (!)
   int y = x;  x >>= 8;  x ^= y;  // almost the same but explicit
are compiled into code like
	movl	%edi, %eax
	sarl	$2, %eax
	xorl	%edi, %eax
As far as I know optimal for all x86 but the very latest 4th generation 
Intel Core processors the following variant is better (2 instead of 3 
cycles; I proved this for e.g. Intel i7 920) because the first two lines 
can be executed simultaneously:
	movl	%edi, %eax
	sarl	$2, %edi    # modify source instead of copy
	xorl	%edi, %eax
Is there a special reason to do that this way?
Interestingly most compilers including ICC and GCC show this strange 
behavior. I had reported this in an Intel forum as well as for GCC a 
long time ago but there has been no real reaction...
Also, why are 4 registers used whereas 2 are sufficient?

In the second function the line
   return ((x>>2) & 15) ^ ((x>>3) & 31);
is compiled into
	movl	%edi, %eax
	shrl	$2, %eax
	andl	$15, %eax
	shrl	$3, %edi
	andl	$31, %edi
	xorl	%eax, %edi
	movl	%edi, %eax
I would have expected that the scheduler interleaves the subexpressions 
and would be able to get rid of the final move like this:
	movl	%edi, %eax
	shrl	$3, %edi    # modify source instead of copy, see above
	shrl	$2, %eax
	andl	$31, %edi
	andl	$15, %eax
	xorl	%edi, %eax    # we need %eax here

I think this is independent of the used high level language.
Is this known to the LLVM community?
May I help to correct this?

Best regards
Jasper