[LLVMdev] Scheduling quirks
Jasper Neumann
jn at sirrida.de
Sat Jan 18 05:13:57 PST 2014
Hello all!
When I compile the following more or less stupid functions with
clang++ -O3 -S test.cpp
===>
int test_register(int x) {
x ^= (x >> 2);
x ^= (x >> 3);
x = x ^ (x >> 4);
int y = x; x >>= 5; x ^= y; // almost the same but explicit
return x;
}
int test_scheduler(int x) {
return ((x>>2) & 15) ^ ((x>>3) & 31);
}
<===
...I get the following result:
===>
.file "test.cpp"
.text
.globl _Z13test_registeri
.align 16, 0x90
.type _Z13test_registeri, at function
_Z13test_registeri: # @_Z13test_registeri
.cfi_startproc
# BB#0: # %entry
movl %edi, %eax
sarl $2, %eax
xorl %edi, %eax
movl %eax, %ecx
sarl $3, %ecx
xorl %eax, %ecx
movl %ecx, %edx
sarl $4, %edx
xorl %ecx, %edx
movl %edx, %eax
sarl $5, %eax
xorl %edx, %eax
retq
.Ltmp0:
.size _Z13test_registeri, .Ltmp0-_Z13test_registeri
.cfi_endproc
.globl _Z14test_scheduleri
.align 16, 0x90
.type _Z14test_scheduleri, at function
_Z14test_scheduleri: # @_Z14test_scheduleri
.cfi_startproc
# BB#0: # %entry
movl %edi, %eax
shrl $2, %eax
andl $15, %eax
shrl $3, %edi
andl $31, %edi
xorl %eax, %edi
movl %edi, %eax
retq
.Ltmp1:
.size _Z14test_scheduleri, .Ltmp1-_Z14test_scheduleri
.cfi_endproc
.ident "clang version 3.5 (trunk 199507)"
.section ".note.GNU-stack","", at progbits
<===
Now once more in detail.
The lines
x ^= (x >> 2);
and
x = x ^ (x >> 4);
and (!)
int y = x; x >>= 8; x ^= y; // almost the same but explicit
are compiled into code like
movl %edi, %eax
sarl $2, %eax
xorl %edi, %eax
As far as I know optimal for all x86 but the very latest 4th generation
Intel Core processors the following variant is better (2 instead of 3
cycles; I proved this for e.g. Intel i7 920) because the first two lines
can be executed simultaneously:
movl %edi, %eax
sarl $2, %edi # modify source instead of copy
xorl %edi, %eax
Is there a special reason to do that this way?
Interestingly most compilers including ICC and GCC show this strange
behavior. I had reported this in an Intel forum as well as for GCC a
long time ago but there has been no real reaction...
Also, why are 4 registers used whereas 2 are sufficient?
In the second function the line
return ((x>>2) & 15) ^ ((x>>3) & 31);
is compiled into
movl %edi, %eax
shrl $2, %eax
andl $15, %eax
shrl $3, %edi
andl $31, %edi
xorl %eax, %edi
movl %edi, %eax
I would have expected that the scheduler interleaves the subexpressions
and would be able to get rid of the final move like this:
movl %edi, %eax
shrl $3, %edi # modify source instead of copy, see above
shrl $2, %eax
andl $31, %edi
andl $15, %eax
xorl %edi, %eax # we need %eax here
I think this is independent of the used high level language.
Is this known to the LLVM community?
May I help to correct this?
Best regards
Jasper
More information about the llvm-dev
mailing list