[PATCH] [Peephole] Advanced rewriting of copies to avoid cross register banks copies.

Tue Jun 10 09:56:25 PDT 2014

Hi,

The proposed patch extends the peephole optimization introduced in r190713 to allow even more cross register banks copies rewriting.
As it is, the extension may not be that useful, but I thought it may be easier to reviewer than the complete solution (see Motivating Examples and What Is Next?).

Thanks for your feedback.

** Context  **

In r190713 we introduced a peephole optimization that produces register-coalescer friendly copies when possible.
This optimization basically looks through a chain of copies to find a more suitable source for a cross register banks copy.
E.g.,
b = copy A <-- cross-bank copy
…
C = copy b <-- cross-bank copy

Is rewritten into:
b = copy A  <-- cross-bank copy
…
C = copy A <-- same-bank copy 

However, there are several instructions that are lowered via cross-bank copies that this optimization fails to optimize.
E.g.
b = insert_subreg e, A, sub0 <-- cross-bank copy
…
C = copy b.sub0 <-- cross-bank copy

Ideally, we would like to produce the following code:
b = insert_subreg e, A, sub0 <-- cross-bank copy
…
C = copy A <-- same-bank copy

** Proposed Patch **

The proposed patch taught the existing cross-bank copy optimization how to deal with the instructions that generate cross-bank copies, i.e., insert_subreg, extract_subreg, reg_sequence, and subreg_to_reg.
We introduce a new helper class for that: ValueTracker.
This class implements the logic to look through the copy related instructions and get the related source.

For now, the advanced copy rewriting is disabled by default as it is not sufficient to solve the motivating examples and I had a hard time to come up with a test case because of that (see motivating example section). However, you can give it a try on your favorite platform with -disable-adv-copy-opt=false and if it helps, I would be happy to add a test case!

I have also checked that the introduced refactoring does not change the current code gen through the entire llvm-testsuite + SPECs, when the extension is disable, for both x86_64 and arm64 with both O3 and Os.

** Motivating Examples **

Let us consider a couple of examples.

* armv7s *

define <2 x i32> @testuvec(<2 x i32> %A, <2 x i32> %B) nounwind {
entry:
  %div = udiv <2 x i32> %A, %B
  ret <2 x i32> %div
}

We would like the following code to be generated on swift (which has a udiv instruction):
// %A is in r0, r1
// %B is in r2, r3
	udiv	r0, r2, r0
	udiv	r1, r3, r1
	bx lr

However, we generate a far more complicated sequence of instructions because we do not recognize that we are moving r0, r1, etc, through d registers:
	vmov	d1, r0, r1
	vmov	d0, r2, r3
	vmov	r1, s2
	vmov	r0, s0
	vmov	r2, s3
	udiv	r0, r1, r0
	vmov	r1, s1
	udiv	r1, r2, r1
	vmov.32	d16[0], r0
	vmov.32	d16[1], r1
	vmov	r0, r1, d16
	bx	lr

* AArch64 *

define i64 @test2(i128 %arg) {
  %vec = bitcast i128 %arg to <2 x i64>
  %scalar = extractelement <2 x i64> %vec, i32 0
  ret i64 %scalar
}

One would expect that this code :
// %arg is in x0, x1
// we simply return x0
	ret

However, we generate a less straight forward sequence:
	fmov	d0, x0
	ins.d	v0[1], x1
	fmov	x0, d0
	ret

The proposed patch is not sufficient to catch those cases yet, as they use target specific instructions to implement the insert_subreg, extract_subreg logic. However, if the lowering was using the generic instructions, this optimization would have helped. See "What Is Next?” for how I plan to tackle that.

** Testcase ?! **

Since the current patch does not yet support the motivating examples, I do not have something reasonably small that exercises the new path. Thus, I have disabled it by default until we have the full support.
Again, if you think that this optimization can help some of the cases you are seeing, give it a try, and propose your test case!

** What Is Next? **

* Teach the optimization about target specific nodes, so that we can handle the motivating examples.
The idea would be to add new tablegen properties so that we would be able to specify that an instruction is similar to a insert_subreg instruction, etc., the same way we did with bitcast (though a little bit more complicated).
* Enable the optimization by default or provide a target hook to control it.

Thanks,
-Quentin

http://reviews.llvm.org/D4086

Files:
  lib/CodeGen/PeepholeOptimizer.cpp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D4086.10285.patch
Type: text/x-patch
Size: 17248 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140610/16493745/attachment.bin>