[PATCH] D23253: [X86] Generalized transformation of `definstr gr8; movzx gr32, gr8` to `xor gr32, gr32; definstr gr8`
bryant via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 22 18:57:02 PDT 2016
bryant added a comment.
In https://reviews.llvm.org/D23253#522637, @DavidKreitzer wrote:
> Hi bryant,
>
> I haven't found time to review this patch in detail yet, but here are some initial comments/questions.
>
> Given some of your sample generated-code changes, I expected to see lots of changes in tests/CodeGen/X86. Are you planning to add those changes to this patch later?
Yes, definitely. Will include the updates to `test/*.ll` in my next diff
release.
> Your output code examples have a few instances of this:
>
> + xorl %r10d, %r10d
> + movb 2(%rdi), %r10b
>
> Rather than insert an xor here, you'd prefer to convert the movb to movzbl.
Just to clarify, the optimal result is `movzbl 2(%rdi), %r10d`, not
`movb 2(%rdi), %al; movzbl %al, %r10d` (which is the status quo)
nor
`xorl %r10d, %r10d; movb 2(%rdi), %r10d` (which is what this patch does)
Correct?
> Converting movb to movzbl (and movw to movzwl) is essentially what FixupBWInstPass does. The author of that pass was deliberately aggressive about converting movw to movzwl but a bit more conservative about converting to movb to movzbl. Here are the relevant comments:
>
> // Only replace 8 bit loads with the zero extending versions if
> // in an inner most loop and not optimizing for size. This takes
> // an extra byte to encode, and provides limited performance upside.
I find this comment strange. The stats on x86-64 are:
21:38:41 ~/3rd/llvm> python iacapipe.py --arch 64 a.out
# Throughput: 0.50; Uops: 2; Latency: 6; Size: 7
kuper:
mov 0x2(%rdi),%al
movzbl %al,%r10d
# Throughput: 0.50; Uops: 1; Latency: 5; Size: 7
this_patch:
xor %r10d,%r10d
mov 0x2(%rdi),%r10b
# Throughput: 0.50; Uops: 1; Latency: 5; Size: 5
optimal:
movzbl 0x2(%rdi),%r10d
And on x86-32:
21:39:32 ~/3rd/llvm> python iacapipe.py --arch 32 a.out
# Throughput: 0.50; Uops: 2; Latency: 6; Size: 6
kuper:
mov 0x2(%edi),%al
movzbl %al,%eax
# Throughput: 0.50; Uops: 1; Latency: 5; Size: 5
this_patch:
xor %eax,%eax
mov 0x2(%edi),%al
# Throughput: 0.50; Uops: 1; Latency: 5; Size: 4
optimal:
movzbl 0x2(%edi),%eax
So the converted movzb to movzbl is smaller and has fewer micro-ops. Am I
missing something obvious?
>
>
> // Always try to replace 16 bit load with 32 bit zero extending.
> // Code size is the same, and there is sometimes a perf advantage
> // from eliminating a false dependence on the upper portion of
> // the register.
>
>
> This leads to 2 questions for me.
>
> (1) What is the code size impact of this new pass?
I'll compile a nice histogram from the diff data in
https://reviews.llvm.org/P7213 and https://reviews.llvm.org/P7214 .
> (2) How does the behavior of this new pass compare to simply changing X86FixupBWInsts to always optimize the 8-bit case, i.e. not check for innermost loops?
If the above results from IACA are to be trusted, it would be better to always
convert "movb; movzbl" to plain old movzbl. Currently, this pass would pre-empt
X86FixupBWInsts by transforming it into "xor; movb" before the later pass has a
chance to see the movb-movzbl pattern. It would be easy to add special cases
patterns that this pass should leave untouched.
> Thanks,
> Dave
Please let me know if additional clarification is needed.
Repository:
rL LLVM
https://reviews.llvm.org/D23253
More information about the llvm-commits
mailing list