[llvm] r195496 - X86: Perform integer comparisons at i32 or larger.

Sat Dec 28 15:39:00 PST 2013

Well, after some more prodding, I dug into this *much* more deeply. I can
now explain rather precisely the change to bzip2 which causes this patch to
improve performance, and interestingly I no longer think this patch is the
correct approach. Read on, because this was fun. However, be forewarned,
this is more of a blog post than an email. =]

So first, let's all get on the same page about exactly what performance
problem I'm seeing this patch address (Jim, I *think* when we talked this
was the one you were looking at as well, but lemme know if not, I wasn't
100% certain). Here is a snippet of code produced by Clang with this patch
reverted for the really hot inner function in the main loop of bzip2 (I've
based all my benchmarks and measurements on v1.0.6, the latest open source
release):

http://paste2.org/0YGNdVJV

   1. # BB#2: # %if.end
   2. leal 1(%rdi), %eax
   3. leal 1(%rsi), %ebp
   4. movb (%rdx,%rax), %al
   5. movb (%rdx,%rbp), %bl
   6. cmpb %bl, %al
   7. jne .LBB27_1

Now, originally the theory was that cmpb was being slow because if you make
it a cmpl it sped up (for me, by around 3.5%). This is what this patch does:

http://paste2.org/Y7D8cIcV

   1. # BB#3: # %if.end
   2. leal 1(%rdi), %eax
   3. leal 1(%rsi), %ebp
   4. movzbl (%rdx,%rax), %eax
   5. movzbl (%rdx,%rbp), %ebx
   6. cmpl %ebx, %eax
   7. jne .LBB27_1

Ok, cool, this goes faster, but *why*? I was thinking it was around a
partial register stall, but I didn't really understand where the stall came
from. So I tried switching the above code back to 'cmpb':

http://paste2.org/h0MbL2jM

   1. # BB#3: # %if.end
   2. leal 1(%rdi), %eax
   3. leal 1(%rsi), %ebp
   4. movzbl (%rdx,%rax), %eax
   5. movzbl (%rdx,%rbp), %ebx
   6. cmpb %bl, %al
   7. jne .LBB27_1

What I didn't expect was that this was just as fast (at least for me on the
sandybridge machine I was testing on primarily). This got me thinking about
how partial register stalls come to be. You write to one portion of the
register, and then it gets merged with whatever was already in the high
bytes. Well, there are kind of two places where that could happen in the
original example (movb, movb, cmpb): either when we write the partial
register with movb, or when we read it with cmpb. For more discussion of
this from Intel, see their Optimization Reference Manual section 3.5.2.4.

But because we read it with the 'b' suffix, we don't actually read the high
bytes! So we shouldn't be seeing a stall on the read side. So it seems like
(at least on sandybrige) there is merging of partial registers on *write*
even without a read. Then the movb would be enough to cause the slowdown
regardless of how we read it. This at least explains the measured
performance shifts I see on my sandybridge. When I take the original
assembly for all of bzip2, and I modify just these movb instructions to be
movzbl, I see roughly the same performance uplift as we get with ToT and
Jim's patch.

Ok, so it seems that on sandybridge (and I think haswell but I don't have
one handy to test as thoroughly) it is important that when we don't need
the high bytes of a register for an operation and we are going to movb or
movw into it, we use movzbl or movzwl (resp.) to remove the false
dependency at write time on the high bytes. I think that would accomplish
what this patch set out to do, but in a much more targeted way. It will
also catch many more of these issues because movb has the potential to
cause this, not just those feeding 'cmp'.

Also, the measurements I dave done indicate that once we do the above, this
patch should be reverted. I can't find any performance problems with either
8-bit cmpb (or 16-bit cmpw other than the 1-byte prefix cost in encoding),
and using the more precisely sized cmp instructions lets us fold one of the
two memory operands above into the cmp instruction. One thing I'm still
confused by is why the original code didn't do that fold.

Here is what I would intuitively expect:

http://paste2.org/KgLfPD8g

   1. # BB#2: # %if.end
   2. leal 1(%rdi), %ebx
   3. leal 1(%rsi), %ebp
   4. movzbl (%rdx,%rbp), %eax
   5. cmpb %al, (%rdx,%rbx)
   6. jne .LBB27_1

Note that I started from the original version (without this patch) and
manually applied this (and all of the other stuff I'm about to discuss).
I'm also re-register allocating here. It'll become more obvious later why.
=] There is no effective change. we're now loading only one side of the
comparison into a register, zext-ing it to ensure no false dependency, and
directly pulling the other byte from memory.

However, there is a lot more wrong. Let's start off with the simple and
obvious stuff. Why are we taking up two more registers and doing lea's just
to add a displacement to the addresses? The addressing modes using the
results have an empty displacement! So now we have this:

http://paste2.org/mXZmU0df

   1. # BB#2: # %if.end
   2. movzbl 1(%rdx,%rsi), %eax
   3. cmpb %al, 1(%rdx,%rdi)
   4. jne .LBB27_1

Cool, this looks really nice now, and benchmarking confirms it. But before
we can actually do either of these transforms, we'll also have to fix how
this is being used. Time for a more complete example:

http://paste2.org/MjW043aZ

   1. # BB#2: # %if.end
   2. leal 1(%rdi), %eax
   3. leal 1(%rsi), %ebp
   4. movb (%rdx,%rax), %al
   5. movb (%rdx,%rbp), %bl
   6. cmpb %bl, %al
   7. jne .LBB27_1
   8.
   9. # ... snip ...
   10.
   11. .LBB27_1: # %if.then
   12. cmpb %al, %bl
   13. setb %al
   14. .LBB27_37: # %return
   15. movzbl %al, %eax
   16. popq %rbx
   17. popq %rbp
   18. ret

Ok, WTF. We 'cmpb %bl, %al', jump, and then 'cmpb %al, %bl'???? This makes
no sense. *every* predecessor looks like this. The reason for the LB27_37
label sneaking around there is because there are even more blocks which end
in essentially the same thing but with different registers instead of %al
and %bl. Whatever the registers, its always the same pattern: "cmp X, Y;
 jne ....;  cmp Y, X;  setb %al". And if we didn't re-do the "cmp", it
wouldn't have mattered that they were different registers!!! So lets fix
that:

http://paste2.org/z2gwXvAY

   1. # BB#2: # %if.end
   2. movzbl 1(%rdx,%rsi), %eax
   3. cmpb %al, 1(%rdx,%rdi)
   4. jne .LBB27_1
   5. # ... <snip> ...
   6.
   7. .LBB27_1: # %if.then
   8. seta %al
   9. movzbl %al, %eax
   10. ret

Yea, that's a bit more plausible. =] Note that the 'popq' are gone because
while when we fold away all of the lea's we no longer need the two scratch
registers. The .LBB27_37 is cone because we no longer have 6 copies
re-doing a comparison with different registers.

One final tweak that is totally unrelated to the rest of this. We're
missing a DAG combine for the latch of the inner loop here which looks like
this:

http://paste2.org/aO0hBw1O

   1. # BB#36: # %if.end396
   2. # in Loop: Header=BB27_14 Depth=1
   3. addl $8, %edi
   4. addl $8, %esi
   5. xorl %eax, %eax
   6. cmpl %r8d, %edi
   7. movl %r8d, %ebp
   8. cmovbl %eax, %ebp
   9. subl %ebp, %edi
   10. cmpl %r8d, %esi
   11. movl %r8d, %ebp
   12. cmovbl %eax, %ebp
   13. subl %ebp, %esi
   14. decl (%r9)
   15. addl $-8, %r10d
   16. jns .LBB27_14
   17. jmp .LBB27_37
   18.
   19. # ... <snip> ...
   20.
   21. .LBB27_37: # %return
   22. movzbl %al, %eax
   23. popq %rbx
   24. popq %rbp
   25. ret

I'm always really suspicious when I see cmp followed closely by sub.
Indeed, this is a case of us not realizing that the cmp and sub can be
combined. When we do that, we also remove this last use of a scratch
register (%ebp), confirming that we can get rid of all the popq business.
When we do, we also no longer need to jmp, we can just directly ret:

http://paste2.org/v5K9COgf

   1. # BB#36: # %if.end396
   2. # in Loop: Header=BB27_14 Depth=1
   3. addl $8, %edi
   4. addl $8, %esi
   5. movl %edi, %eax
   6. subl %r8d, %eax
   7. cmovae %eax, %edi
   8. movl %esi, %eax
   9. subl %r8d, %eax
   10. cmovae %eax, %esi
   11. decl (%r9)
   12. addl $-8, %r10d
   13. jns .LBB27_14
   14. xorl %eax, %eax
   15. ret

Yea, that looks more like sane code. =]

All of this combines for another 3.5% on bzip2 on sandybridge, despite just
changing the mainGtU function manually by hand.

I'll probably break out some of these issues into bugs and such once I have
more minimal test cases. I think the most important thing is to get the
correct fix for partial register stalls on sandybridge committed and revert
this patch so we don't keep paying a tax for a halfway fix. The rest is
gravy, if very interesting all the same.

Just for reference, here is the complete asm before and after:

Before: http://paste2.org/CC50nde3
After: http://paste2.org/J3DHbPJC
- 100 fewer instructions
- 30% smaller encoded size after assembling
- 7% faster on sandybridge, 3.5% faster than with this patch (r195496)

Also for context, the benchmarking was done with a little compression
algorithm benchmarking tool that I'm working to open source and will
hopefully add to the test suite soon after. It's based on bzip2 1.0.6
(latest sources) and I primarily measured performance by compressing
linux-3.12.tar with it.

-Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131228/182454e0/attachment.html>