[LLVMdev] Scheduling quirks

Wed Jan 22 10:36:16 PST 2014

On Jan 22, 2014, at 10:03 AM, Jasper Neumann <jn at sirrida.de> wrote:

> Hello Andrew!
> 
> Referring:
> int test_scheduler(int x) {
> return ((x>>2) & 15) ^ ((x>>3) & 31);
> }
> =>
> 	movl	%edi, %eax
> 	shrl	$2, %eax
> 	andl	$15, %eax
> 	shrl	$3, %edi
> 	andl	$31, %edi
> 	xorl	%eax, %edi
> 	movl	%edi, %eax
> instead of
> 	movl	%edi, %eax
> 	shrl	$3, %edi    # modify source instead of copy
> 	shrl	$2, %eax    # modifications interlaced
> 	andl	$31, %edi
> 	andl	$15, %eax
> 	xorl	%edi, %eax    # we need %eax here
> 
> 
> > In your example, we're copying from/to argument/return registers,
> > hence the extra copies.
> 
> This might be the reason for the final move.
> However a simple peephole optimizer could catch this case.
> 
> 
> > [...] Do you still see the extra copies on LLVM trunk?
> 
> I retested this with a fresh checkout and compile of trunk (svn 199769) and got the same result.
> 
> BTW: Providing e.g. -march=atom helps a bit by sporting the expected interleaving for this example but IMHO this ought to be the case even without explicitely specifying a processor type.
> 
> 
> The missed optimization regarding the modification of a copy occurs quite often and costs a cycle on almost all x86 processors.

I have seen many cases where we could eliminate copies. However, I haven’t thought of a very simple fix that would handle the general problem and haven’t been motivated to add complexity to handle specific cases. Keep in mind that physreg copies as in your case are handled totally different from vreg copies. Reducing instructions is always good. However a register copy is a small instruction and I *think* it is usually zero cycles on the last few generations from Intel.

Anyway, feel free to file a bug and offer suggested fixes.

If I wanted to solve this particular case, the first thing I would do is figure out why the register allocator isn’t biasing the xorl def to %eax. I think it’s because local allocation order is top-down. It also doesn’t realize that biasing the incoming arguments is useless because of the interference so may back itself into a corner.

It’s possible that we could add copy-eliminating peepholes after regalloc. I don’t know that it is a good idea and we would have to show pretty compelling performance results to add the complexity.

-Andy