[PATCH] D23253: [X86] Generalized transformation of `definstr gr8; movzx gr32, gr8` to `xor gr32, gr32; definstr gr8`

Mon Aug 22 18:57:02 PDT 2016

bryant added a comment.

In https://reviews.llvm.org/D23253#522637, @DavidKreitzer wrote:

> Hi bryant,
>
> I haven't found time to review this patch in detail yet, but here are some initial comments/questions.
>
> Given some of your sample generated-code changes, I expected to see lots of changes in tests/CodeGen/X86. Are you planning to add those changes to this patch later?

Yes, definitely. Will include the updates to `test/*.ll` in my next diff
release.

> Your output code examples have a few instances of this:

> 

> +    xorl    %r10d, %r10d

>  +    movb    2(%rdi), %r10b

> 

> Rather than insert an xor here, you'd prefer to convert the movb to movzbl.

Just to clarify, the optimal result is `movzbl 2(%rdi), %r10d`, not

`movb 2(%rdi), %al; movzbl %al, %r10d` (which is the status quo)

nor

`xorl %r10d, %r10d; movb 2(%rdi), %r10d` (which is what this patch does)

Correct?

> Converting movb to movzbl (and movw to movzwl) is essentially what FixupBWInstPass does. The author of that pass was deliberately aggressive about converting movw to movzwl but a bit more conservative about converting to movb to movzbl. Here are the relevant comments:

> 

>   // Only replace 8 bit loads with the zero extending versions if

>   // in an inner most loop and not optimizing for size. This takes

>   // an extra byte to encode, and provides limited performance upside.

I find this comment strange. The stats on x86-64 are:

  21:38:41 ~/3rd/llvm> python iacapipe.py --arch 64 a.out
  # Throughput: 0.50; Uops: 2; Latency: 6; Size: 7
  kuper:
      mov    0x2(%rdi),%al
      movzbl %al,%r10d

  # Throughput: 0.50; Uops: 1; Latency: 5; Size: 7
  this_patch:
      xor    %r10d,%r10d
      mov    0x2(%rdi),%r10b

  # Throughput: 0.50; Uops: 1; Latency: 5; Size: 5
  optimal:
      movzbl 0x2(%rdi),%r10d

And on x86-32:

  21:39:32 ~/3rd/llvm> python iacapipe.py --arch 32 a.out
  # Throughput: 0.50; Uops: 2; Latency: 6; Size: 6
  kuper:
      mov    0x2(%edi),%al
      movzbl %al,%eax

  # Throughput: 0.50; Uops: 1; Latency: 5; Size: 5
  this_patch:
      xor    %eax,%eax
      mov    0x2(%edi),%al

  # Throughput: 0.50; Uops: 1; Latency: 5; Size: 4
  optimal:
      movzbl 0x2(%edi),%eax

So the converted movzb to movzbl is smaller and has fewer micro-ops. Am I
missing something obvious?

> 

> 

>   // Always try to replace 16 bit load with 32 bit zero extending.

>   // Code size is the same, and there is sometimes a perf advantage

>   // from eliminating a false dependence on the upper portion of

>   // the register.

>    

> 

> This leads to 2 questions for me.

> 

> (1) What is the code size impact of this new pass?

I'll compile a nice histogram from the diff data in
https://reviews.llvm.org/P7213 and https://reviews.llvm.org/P7214 .

> (2) How does the behavior of this new pass compare to simply changing X86FixupBWInsts to always optimize the 8-bit case, i.e. not check for innermost loops?

If the above results from IACA are to be trusted, it would be better to always
convert "movb; movzbl"  to plain old movzbl. Currently, this pass would pre-empt
X86FixupBWInsts by transforming it into "xor; movb" before the later pass has a
chance to see the movb-movzbl pattern. It would be easy to add special cases
patterns that this pass should leave untouched.

> Thanks,

> Dave

Please let me know if additional clarification is needed.

Repository:
  rL LLVM

https://reviews.llvm.org/D23253