[PATCH] D23253: [X86] Generalized transformation of `definstr gr8; movzx gr32, gr8` to `xor gr32, gr32; definstr gr8`

Sun Aug 14 13:37:06 PDT 2016

bryant added a comment.

In https://reviews.llvm.org/D23253#508298, @mkuper wrote:

> Hi bryant,
>
> Thanks a lot for working on this! The improvements looks really nice, and it's
>  great that this fixes the pessimizations from https://reviews.llvm.org/rL274692.
>
> I haven't really looked at the patch yet, only skimmed it briefly - so I still
>  don't have even the slightest opinion on the logic.  So, for now, some very
>  general comments/questions:

Please, please review. I've put a great deal of thinking into this.

> - Have you considered adding logic to ExecutionDepsFix or X86FixupBWInsts, as opposed to a new pass? I haven't thought about this too much myself, but those are post-RA passes with somewhat similar goals, and it may make sense to share some of the code.

How do you propose doing this in either of those two passes? And in any case,
the set of heurestics relies on virtual register liveness data that is rewritten
away long before pre-sched/pre-emit passes execute.

> - How generic is this? E.g. does this handle PR28442?

It matches on any GR8-defining instruction, so I'd say...as generic as can be?
And yes, of course it handles PR28442:

  bash
  19:54:19 ~/3rd/llvm> cat > pr28442.c
  int foo(int a, int b, int c) {
    return (a > 0 && b > 0 && c > 0);
  }
  20:15:40 ~/3rd/llvm> clang -o - -S -emit-llvm -O3 pr28442.c | llc -filetype=obj -O3 -o setccfixup.o -setcc-fixup
  20:16:07 ~/3rd/llvm> clang -o - -S -emit-llvm -O3 pr28442.c | llc -filetype=obj -O3 -o zextfixup.o
  20:16:34 ~/3rd/llvm> diff -u <(python iacapipe.py setccfixup.o) <(python iacapipe.py zextfixup.o)
  --- /dev/fd/63  2016-08-14 20:16:48.477885771 +0000
  +++ /dev/fd/62  2016-08-14 20:16:48.477885771 +0000
  @@ -1,13 +1,13 @@
  -# Throughput: 2.65; Uops: 10; Latency: 5; Size: 23
  +# Throughput: 2.65; Uops: 9; Latency: 5; Size: 22
   foo:
       test   %edi,%edi
       setg   %al
       test   %esi,%esi
       setg   %cl
       and    %al,%cl
  +    xor    %eax,%eax
       test   %edx,%edx
       setg   %al
       and    %cl,%al
  -    movzbl %al,%eax
       retq

> - Do you know why this is failing on cases that we catch with https://reviews.llvm.org/rL274692? If the improvement is large enough we don't have to be strictly better than existing code (https://reviews.llvm.org/rL274692 itself wasn't :) ), but it'd be good to understand what's going on so we can at least guess whether it can happen in "interesting" non-toy cases.

>From my tests on `CodeGen/X86/*.ll`, it only ever fails on x86-32 for two
reasons:

1. The x32 allocation order for CSRs prioritizes ESI and EDI over EBX. Because

this pass never touches unused CSRs, it is possible for a function to alloc
E{AX,CX,DX,SI,DI} but not EBX and thus reducing the pool of available
`GR32_with_sub8bit` by one. This is illustrated by `2008-09-11-CoalescerBug.ll`:

  bash
  20:22:18 ~/3rd/llvm> diff -u --unified=9999999 <(python iacapipe.py obj/kuper-x86/2008-09-11-CoalescerBug.o) <(python iacapipe.py obj/control-x86/2008-09-11-CoalescerBug.o)
  --- /dev/fd/63  2016-08-14 20:23:00.352808087 +0000
  +++ /dev/fd/62  2016-08-14 20:23:00.352808087 +0000
  @@ -1,35 +1,34 @@
  -# Throughput: 9.60; Uops: 43; Latency: 20; Size: 98
  +# Throughput: 10.10; Uops: 45; Latency: 19; Size: 98
   func_3:
  -    push   %ebx
  +    push   %edi
       push   %esi
       push   %eax
       movzwl 0x0,%esi
       and    $0x1,%esi
       movl   $0x1,(%esp)
       call   15 <func_3+0x15>
       xor    %ecx,%ecx
       cmp    $0x2,%eax
       setl   %cl
  -    xor    %ebx,%ebx
       cmp    %ecx,%esi
  -    setge  %bl
  -    xor    %eax,%eax
  +    setge  %al
  +    movzbl %al,%esi
       cmpw   $0x0,0x0
       sete   %al
  -    mov    %eax,%esi
  +    movzbl %al,%edi
       movl   $0x1,(%esp)
       call   3f <func_3+0x3f>
       xor    %ecx,%ecx
  -    cmp    %eax,%esi
  +    cmp    %eax,%edi
       setge  %cl
  -    sub    %ecx,%ebx
  -    cmp    $0x2,%ebx
  +    sub    %ecx,%esi
  +    cmp    $0x2,%esi
       sbb    %eax,%eax
       and    $0x1,%eax
       mov    %eax,(%esp)
       call   58 <func_3+0x58>
       add    $0x4,%esp
       pop    %esi
  -    pop    %ebx
  +    pop    %edi
       ret

2. It's also possible for RA to spill the GR8-defining instruction, thus

preventing this pass from recognizing the pattern. This is seen in
`2009-08-23-SubRegReuseUndo.ll`:

  bash
  --- annot/kuper-x86/2009-08-23-SubRegReuseUndo.o	2016-08-14 20:27:55.032325645 +0000
  +++ annot/control-x86/2009-08-23-SubRegReuseUndo.o	2016-08-14 20:27:07.079072090 +0000
  @@ -1,88 +1,88 @@
  -# Throughput: 31.00; Uops: 113; Latency: 32; Size: 285
  +# Throughput: 32.00; Uops: 116; Latency: 32; Size: 290
   uint80:
       push   %ebp
       push   %ebx
       push   %edi
       push   %esi
       sub    $0x1c,%esp
       movsbl 0x30(%esp),%ecx
  -    xor    %eax,%eax
       test   %cx,%cx
  -    setne  %al
  -    mov    %eax,%esi
  +    setne  0x1b(%esp)
       movzwl %cx,%eax
  -    mov    %ecx,%edi
  +    mov    %ecx,%esi
       mov    %eax,(%esp)
       mov    $0x0,%eax
       movsbl %al,%eax
       mov    %eax,0x4(%esp)
       call   29 <uint80+0x29>
       mov    %eax,%ebx
       or     $0x1,%bl
       movl   $0x1,(%esp)
       call   3c <uint80+0x3c>
       movl   $0x0,0x4(%esp)
       movl   $0x0,(%esp)
       call   50 <uint80+0x50>
  -    mov    %edi,%ecx
  +    mov    %esi,%ecx
  +    mov    %ecx,%edi
       xor    %cl,%al
       xor    %bl,%al
       mov    %eax,%ebp
  -    mov    %esi,0x4(%esp)
  +    movzbl 0x1b(%esp),%eax
  +    mov    %eax,0x4(%esp)
       mov    $0x0,%eax
       movzwl %ax,%eax
       mov    %eax,(%esp)
       call   6c <uint80+0x6c>
       mov    %eax,%esi
       movl   $0x1,0x4(%esp)
       movl   $0x0,(%esp)
       call   82 <uint80+0x82>
       xor    %eax,%eax
       test   %al,%al
       jne .L0
       mov    $0x1,%eax
       xor    %ecx,%ecx
       test   %cl,%cl
       jne .L1
   .L0:
       xor    %eax,%eax
   .L1:
       xor    %ebx,%ebx
       cmp    %eax,%esi
       setne  %bl
       movl   $0xfffffffe,(%esp)
       call   ad <uint80+0xad>
       mov    %ebp,%eax
       movsbl %al,%eax
       mov    %eax,0xc(%esp)
       mov    %ebx,0x8(%esp)
       movl   $0x1,0x4(%esp)
       movl   $0x0,(%esp)
       call   ce <uint80+0xce>
       xor    %eax,%eax
       test   %al,%al
       jne .L2
       mov    0x0,%eax
   .L2:
       mov    0x0,%eax
       movl   $0x1,0x4(%esp)
       movl   $0x0,(%esp)
       call   f2 <uint80+0xf2>
       xor    %eax,%eax
       test   %al,%al
       je .L3
       add    $0x1c,%esp
       pop    %esi
       pop    %edi
       pop    %ebx
       pop    %ebp
       ret
   .L3:
       mov    %edi,%eax
       movsbl %al,%eax
       mov    0x0,%ecx
       mov    %eax,(%esp)
       movl   $0x1,0x4(%esp)
       call   11b <uint80+0x11b>
       sub    $0x8,%esp

> - This is a fairly large patch. It's probably fairly self-contained, and I don't immediately see a way to break into meaningful functional pieces. However, that makes it a priori harder to review. So it'd be really nice to make reviewers' life easier by providing copious documentation. :)

Right, I'll sprinkle in some comments.

> - There's a decent amount of generic code that doesn't look like it should live in the pass. If it's generally useful, then it should live in one of the utility headers. If not, and it's only ever instantiated for a single type, then we can get rid of the template extra complexity. (Also, beware of the lack of compiler support. We need to build LLVM with some fairly old or odd compilers - MSVC 2013, first of foremost. If you haven't seen a pattern used in LLVM, it may be because we still want to be built with compilers that don't support it)

.

> - A couple of pervasive differences from the LLVM coding conventions I've noticed from skimming:

.

> - We generally use camelcase for identifiers, with variables starting with an uppercase letter, and functions with lowercase. Common acronyms (like TRI) are all-caps. There are some exceptions ("standard" constructs like "int i", lower cases with underscores where we want to fit STL style, changes to old code that uses a different convention), but those are fairly rare.

> - We don't usually explicitly compare to nullptr.

Fixed.

Repository:
  rL LLVM

https://reviews.llvm.org/D23253