[PATCH] D23253: [X86] Generalized transformation of `definstr gr8; movzx gr32, gr8` to `xor gr32, gr32; definstr gr8`
bryant via llvm-commits
llvm-commits at lists.llvm.org
Sun Aug 14 13:37:06 PDT 2016
bryant added a comment.
In https://reviews.llvm.org/D23253#508298, @mkuper wrote:
> Hi bryant,
>
> Thanks a lot for working on this! The improvements looks really nice, and it's
> great that this fixes the pessimizations from https://reviews.llvm.org/rL274692.
>
> I haven't really looked at the patch yet, only skimmed it briefly - so I still
> don't have even the slightest opinion on the logic. So, for now, some very
> general comments/questions:
Please, please review. I've put a great deal of thinking into this.
> - Have you considered adding logic to ExecutionDepsFix or X86FixupBWInsts, as opposed to a new pass? I haven't thought about this too much myself, but those are post-RA passes with somewhat similar goals, and it may make sense to share some of the code.
How do you propose doing this in either of those two passes? And in any case,
the set of heurestics relies on virtual register liveness data that is rewritten
away long before pre-sched/pre-emit passes execute.
> - How generic is this? E.g. does this handle PR28442?
It matches on any GR8-defining instruction, so I'd say...as generic as can be?
And yes, of course it handles PR28442:
bash
19:54:19 ~/3rd/llvm> cat > pr28442.c
int foo(int a, int b, int c) {
return (a > 0 && b > 0 && c > 0);
}
20:15:40 ~/3rd/llvm> clang -o - -S -emit-llvm -O3 pr28442.c | llc -filetype=obj -O3 -o setccfixup.o -setcc-fixup
20:16:07 ~/3rd/llvm> clang -o - -S -emit-llvm -O3 pr28442.c | llc -filetype=obj -O3 -o zextfixup.o
20:16:34 ~/3rd/llvm> diff -u <(python iacapipe.py setccfixup.o) <(python iacapipe.py zextfixup.o)
--- /dev/fd/63 2016-08-14 20:16:48.477885771 +0000
+++ /dev/fd/62 2016-08-14 20:16:48.477885771 +0000
@@ -1,13 +1,13 @@
-# Throughput: 2.65; Uops: 10; Latency: 5; Size: 23
+# Throughput: 2.65; Uops: 9; Latency: 5; Size: 22
foo:
test %edi,%edi
setg %al
test %esi,%esi
setg %cl
and %al,%cl
+ xor %eax,%eax
test %edx,%edx
setg %al
and %cl,%al
- movzbl %al,%eax
retq
> - Do you know why this is failing on cases that we catch with https://reviews.llvm.org/rL274692? If the improvement is large enough we don't have to be strictly better than existing code (https://reviews.llvm.org/rL274692 itself wasn't :) ), but it'd be good to understand what's going on so we can at least guess whether it can happen in "interesting" non-toy cases.
>From my tests on `CodeGen/X86/*.ll`, it only ever fails on x86-32 for two
reasons:
1. The x32 allocation order for CSRs prioritizes ESI and EDI over EBX. Because
this pass never touches unused CSRs, it is possible for a function to alloc
E{AX,CX,DX,SI,DI} but not EBX and thus reducing the pool of available
`GR32_with_sub8bit` by one. This is illustrated by `2008-09-11-CoalescerBug.ll`:
bash
20:22:18 ~/3rd/llvm> diff -u --unified=9999999 <(python iacapipe.py obj/kuper-x86/2008-09-11-CoalescerBug.o) <(python iacapipe.py obj/control-x86/2008-09-11-CoalescerBug.o)
--- /dev/fd/63 2016-08-14 20:23:00.352808087 +0000
+++ /dev/fd/62 2016-08-14 20:23:00.352808087 +0000
@@ -1,35 +1,34 @@
-# Throughput: 9.60; Uops: 43; Latency: 20; Size: 98
+# Throughput: 10.10; Uops: 45; Latency: 19; Size: 98
func_3:
- push %ebx
+ push %edi
push %esi
push %eax
movzwl 0x0,%esi
and $0x1,%esi
movl $0x1,(%esp)
call 15 <func_3+0x15>
xor %ecx,%ecx
cmp $0x2,%eax
setl %cl
- xor %ebx,%ebx
cmp %ecx,%esi
- setge %bl
- xor %eax,%eax
+ setge %al
+ movzbl %al,%esi
cmpw $0x0,0x0
sete %al
- mov %eax,%esi
+ movzbl %al,%edi
movl $0x1,(%esp)
call 3f <func_3+0x3f>
xor %ecx,%ecx
- cmp %eax,%esi
+ cmp %eax,%edi
setge %cl
- sub %ecx,%ebx
- cmp $0x2,%ebx
+ sub %ecx,%esi
+ cmp $0x2,%esi
sbb %eax,%eax
and $0x1,%eax
mov %eax,(%esp)
call 58 <func_3+0x58>
add $0x4,%esp
pop %esi
- pop %ebx
+ pop %edi
ret
2. It's also possible for RA to spill the GR8-defining instruction, thus
preventing this pass from recognizing the pattern. This is seen in
`2009-08-23-SubRegReuseUndo.ll`:
bash
--- annot/kuper-x86/2009-08-23-SubRegReuseUndo.o 2016-08-14 20:27:55.032325645 +0000
+++ annot/control-x86/2009-08-23-SubRegReuseUndo.o 2016-08-14 20:27:07.079072090 +0000
@@ -1,88 +1,88 @@
-# Throughput: 31.00; Uops: 113; Latency: 32; Size: 285
+# Throughput: 32.00; Uops: 116; Latency: 32; Size: 290
uint80:
push %ebp
push %ebx
push %edi
push %esi
sub $0x1c,%esp
movsbl 0x30(%esp),%ecx
- xor %eax,%eax
test %cx,%cx
- setne %al
- mov %eax,%esi
+ setne 0x1b(%esp)
movzwl %cx,%eax
- mov %ecx,%edi
+ mov %ecx,%esi
mov %eax,(%esp)
mov $0x0,%eax
movsbl %al,%eax
mov %eax,0x4(%esp)
call 29 <uint80+0x29>
mov %eax,%ebx
or $0x1,%bl
movl $0x1,(%esp)
call 3c <uint80+0x3c>
movl $0x0,0x4(%esp)
movl $0x0,(%esp)
call 50 <uint80+0x50>
- mov %edi,%ecx
+ mov %esi,%ecx
+ mov %ecx,%edi
xor %cl,%al
xor %bl,%al
mov %eax,%ebp
- mov %esi,0x4(%esp)
+ movzbl 0x1b(%esp),%eax
+ mov %eax,0x4(%esp)
mov $0x0,%eax
movzwl %ax,%eax
mov %eax,(%esp)
call 6c <uint80+0x6c>
mov %eax,%esi
movl $0x1,0x4(%esp)
movl $0x0,(%esp)
call 82 <uint80+0x82>
xor %eax,%eax
test %al,%al
jne .L0
mov $0x1,%eax
xor %ecx,%ecx
test %cl,%cl
jne .L1
.L0:
xor %eax,%eax
.L1:
xor %ebx,%ebx
cmp %eax,%esi
setne %bl
movl $0xfffffffe,(%esp)
call ad <uint80+0xad>
mov %ebp,%eax
movsbl %al,%eax
mov %eax,0xc(%esp)
mov %ebx,0x8(%esp)
movl $0x1,0x4(%esp)
movl $0x0,(%esp)
call ce <uint80+0xce>
xor %eax,%eax
test %al,%al
jne .L2
mov 0x0,%eax
.L2:
mov 0x0,%eax
movl $0x1,0x4(%esp)
movl $0x0,(%esp)
call f2 <uint80+0xf2>
xor %eax,%eax
test %al,%al
je .L3
add $0x1c,%esp
pop %esi
pop %edi
pop %ebx
pop %ebp
ret
.L3:
mov %edi,%eax
movsbl %al,%eax
mov 0x0,%ecx
mov %eax,(%esp)
movl $0x1,0x4(%esp)
call 11b <uint80+0x11b>
sub $0x8,%esp
> - This is a fairly large patch. It's probably fairly self-contained, and I don't immediately see a way to break into meaningful functional pieces. However, that makes it a priori harder to review. So it'd be really nice to make reviewers' life easier by providing copious documentation. :)
Right, I'll sprinkle in some comments.
> - There's a decent amount of generic code that doesn't look like it should live in the pass. If it's generally useful, then it should live in one of the utility headers. If not, and it's only ever instantiated for a single type, then we can get rid of the template extra complexity. (Also, beware of the lack of compiler support. We need to build LLVM with some fairly old or odd compilers - MSVC 2013, first of foremost. If you haven't seen a pattern used in LLVM, it may be because we still want to be built with compilers that don't support it)
.
> - A couple of pervasive differences from the LLVM coding conventions I've noticed from skimming:
.
> - We generally use camelcase for identifiers, with variables starting with an uppercase letter, and functions with lowercase. Common acronyms (like TRI) are all-caps. There are some exceptions ("standard" constructs like "int i", lower cases with underscores where we want to fit STL style, changes to old code that uses a different convention), but those are fairly rare.
> - We don't usually explicitly compare to nullptr.
Fixed.
Repository:
rL LLVM
https://reviews.llvm.org/D23253
More information about the llvm-commits
mailing list