[PATCH] [Peephole] Advanced rewriting of copies to avoid cross register banks copies.
Quentin Colombet
qcolombet at apple.com
Tue Jun 10 09:56:25 PDT 2014
Hi,
The proposed patch extends the peephole optimization introduced in r190713 to allow even more cross register banks copies rewriting.
As it is, the extension may not be that useful, but I thought it may be easier to reviewer than the complete solution (see Motivating Examples and What Is Next?).
Thanks for your feedback.
** Context **
In r190713 we introduced a peephole optimization that produces register-coalescer friendly copies when possible.
This optimization basically looks through a chain of copies to find a more suitable source for a cross register banks copy.
E.g.,
b = copy A <-- cross-bank copy
…
C = copy b <-- cross-bank copy
Is rewritten into:
b = copy A <-- cross-bank copy
…
C = copy A <-- same-bank copy
However, there are several instructions that are lowered via cross-bank copies that this optimization fails to optimize.
E.g.
b = insert_subreg e, A, sub0 <-- cross-bank copy
…
C = copy b.sub0 <-- cross-bank copy
Ideally, we would like to produce the following code:
b = insert_subreg e, A, sub0 <-- cross-bank copy
…
C = copy A <-- same-bank copy
** Proposed Patch **
The proposed patch taught the existing cross-bank copy optimization how to deal with the instructions that generate cross-bank copies, i.e., insert_subreg, extract_subreg, reg_sequence, and subreg_to_reg.
We introduce a new helper class for that: ValueTracker.
This class implements the logic to look through the copy related instructions and get the related source.
For now, the advanced copy rewriting is disabled by default as it is not sufficient to solve the motivating examples and I had a hard time to come up with a test case because of that (see motivating example section). However, you can give it a try on your favorite platform with -disable-adv-copy-opt=false and if it helps, I would be happy to add a test case!
I have also checked that the introduced refactoring does not change the current code gen through the entire llvm-testsuite + SPECs, when the extension is disable, for both x86_64 and arm64 with both O3 and Os.
** Motivating Examples **
Let us consider a couple of examples.
* armv7s *
define <2 x i32> @testuvec(<2 x i32> %A, <2 x i32> %B) nounwind {
entry:
%div = udiv <2 x i32> %A, %B
ret <2 x i32> %div
}
We would like the following code to be generated on swift (which has a udiv instruction):
// %A is in r0, r1
// %B is in r2, r3
udiv r0, r2, r0
udiv r1, r3, r1
bx lr
However, we generate a far more complicated sequence of instructions because we do not recognize that we are moving r0, r1, etc, through d registers:
vmov d1, r0, r1
vmov d0, r2, r3
vmov r1, s2
vmov r0, s0
vmov r2, s3
udiv r0, r1, r0
vmov r1, s1
udiv r1, r2, r1
vmov.32 d16[0], r0
vmov.32 d16[1], r1
vmov r0, r1, d16
bx lr
* AArch64 *
define i64 @test2(i128 %arg) {
%vec = bitcast i128 %arg to <2 x i64>
%scalar = extractelement <2 x i64> %vec, i32 0
ret i64 %scalar
}
One would expect that this code :
// %arg is in x0, x1
// we simply return x0
ret
However, we generate a less straight forward sequence:
fmov d0, x0
ins.d v0[1], x1
fmov x0, d0
ret
The proposed patch is not sufficient to catch those cases yet, as they use target specific instructions to implement the insert_subreg, extract_subreg logic. However, if the lowering was using the generic instructions, this optimization would have helped. See "What Is Next?” for how I plan to tackle that.
** Testcase ?! **
Since the current patch does not yet support the motivating examples, I do not have something reasonably small that exercises the new path. Thus, I have disabled it by default until we have the full support.
Again, if you think that this optimization can help some of the cases you are seeing, give it a try, and propose your test case!
** What Is Next? **
* Teach the optimization about target specific nodes, so that we can handle the motivating examples.
The idea would be to add new tablegen properties so that we would be able to specify that an instruction is similar to a insert_subreg instruction, etc., the same way we did with bitcast (though a little bit more complicated).
* Enable the optimization by default or provide a target hook to control it.
Thanks,
-Quentin
http://reviews.llvm.org/D4086
Files:
lib/CodeGen/PeepholeOptimizer.cpp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D4086.10285.patch
Type: text/x-patch
Size: 17248 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140610/16493745/attachment.bin>
More information about the llvm-commits
mailing list