[llvm] r195496 - X86: Perform integer comparisons at i32 or larger.

Sat Dec 28 19:04:30 PST 2013

Thanks for digging into this!

Some comments inline:

On Sat, Dec 28, 2013 at 4:39 PM, Chandler Carruth <chandlerc at google.com>wrote:

> Well, after some more prodding, I dug into this *much* more deeply. I can
> now explain rather precisely the change to bzip2 which causes this patch to
> improve performance, and interestingly I no longer think this patch is the
> correct approach. Read on, because this was fun. However, be forewarned,
> this is more of a blog post than an email. =]
>
>
> So first, let's all get on the same page about exactly what performance
> problem I'm seeing this patch address (Jim, I *think* when we talked this
> was the one you were looking at as well, but lemme know if not, I wasn't
> 100% certain). Here is a snippet of code produced by Clang with this patch
> reverted for the really hot inner function in the main loop of bzip2 (I've
> based all my benchmarks and measurements on v1.0.6, the latest open source
> release):
>
> http://paste2.org/0YGNdVJV
>
>    1. # BB#2: # %if.end
>    2.  leal 1(%rdi), %eax
>    3.  leal 1(%rsi), %ebp
>    4.  movb (%rdx,%rax), %al
>    5.  movb (%rdx,%rbp), %bl
>    6.  cmpb %bl, %al
>    7.  jne .LBB27_1
>
>
> Now, originally the theory was that cmpb was being slow because if you
> make it a cmpl it sped up (for me, by around 3.5%). This is what this patch
> does:
>
> http://paste2.org/Y7D8cIcV
>
>    1. # BB#3: # %if.end
>    2.  leal 1(%rdi), %eax
>    3.  leal 1(%rsi), %ebp
>    4.  movzbl (%rdx,%rax), %eax
>    5.  movzbl (%rdx,%rbp), %ebx
>    6.  cmpl %ebx, %eax
>    7.  jne .LBB27_1
>
>
> Ok, cool, this goes faster, but *why*? I was thinking it was around a
> partial register stall, but I didn't really understand where the stall came
> from. So I tried switching the above code back to 'cmpb':
>
> http://paste2.org/h0MbL2jM
>
>    1. # BB#3: # %if.end
>    2.  leal 1(%rdi), %eax
>    3.  leal 1(%rsi), %ebp
>    4.  movzbl (%rdx,%rax), %eax
>    5.  movzbl (%rdx,%rbp), %ebx
>    6.  cmpb %bl, %al
>    7.  jne .LBB27_1
>
>
> What I didn't expect was that this was just as fast (at least for me on
> the sandybridge machine I was testing on primarily). This got me thinking
> about how partial register stalls come to be. You write to one portion of
> the register, and then it gets merged with whatever was already in the high
> bytes. Well, there are kind of two places where that could happen in the
> original example (movb, movb, cmpb): either when we write the partial
> register with movb, or when we read it with cmpb. For more discussion of
> this from Intel, see their Optimization Reference Manual section 3.5.2.4.
>
> But because we read it with the 'b' suffix, we don't actually read the
> high bytes! So we shouldn't be seeing a stall on the read side. So it seems
> like (at least on sandybrige) there is merging of partial registers on
> *write* even without a read.
>

Interesting. This seems contrary to Agner's Sandy Bridge information in
section 9.11 of <http://www.agner.org/optimize/microarchitecture.pdf>
 "There is no penalty for writing to a partial register unless there is a
later read from a larger part of the same register". Maybe you could share
your findings with him?

-- Sean Silva

Then the movb would be enough to cause the slowdown regardless of how we
> read it. This at least explains the measured performance shifts I see on my
> sandybridge. When I take the original assembly for all of bzip2, and I
> modify just these movb instructions to be movzbl, I see roughly the same
> performance uplift as we get with ToT and Jim's patch.
>
> Ok, so it seems that on sandybridge (and I think haswell but I don't have
> one handy to test as thoroughly) it is important that when we don't need
> the high bytes of a register for an operation and we are going to movb or
> movw into it, we use movzbl or movzwl (resp.) to remove the false
> dependency at write time on the high bytes. I think that would accomplish
> what this patch set out to do, but in a much more targeted way. It will
> also catch many more of these issues because movb has the potential to
> cause this, not just those feeding 'cmp'.
>
> Also, the measurements I dave done indicate that once we do the above,
> this patch should be reverted. I can't find any performance problems with
> either 8-bit cmpb (or 16-bit cmpw other than the 1-byte prefix cost in
> encoding), and using the more precisely sized cmp instructions lets us fold
> one of the two memory operands above into the cmp instruction. One thing
> I'm still confused by is why the original code didn't do that fold.
>
> Here is what I would intuitively expect:
>
> http://paste2.org/KgLfPD8g
>
>    1. # BB#2: # %if.end
>    2.  leal 1(%rdi), %ebx
>    3.  leal 1(%rsi), %ebp
>    4.  movzbl (%rdx,%rbp), %eax
>    5.  cmpb %al, (%rdx,%rbx)
>    6.  jne .LBB27_1
>
>
> Note that I started from the original version (without this patch) and
> manually applied this (and all of the other stuff I'm about to discuss).
> I'm also re-register allocating here. It'll become more obvious later why.
> =] There is no effective change. we're now loading only one side of the
> comparison into a register, zext-ing it to ensure no false dependency, and
> directly pulling the other byte from memory.
>
> However, there is a lot more wrong. Let's start off with the simple and
> obvious stuff. Why are we taking up two more registers and doing lea's just
> to add a displacement to the addresses? The addressing modes using the
> results have an empty displacement! So now we have this:
>
> http://paste2.org/mXZmU0df
>
>    1. # BB#2: # %if.end
>    2.  movzbl 1(%rdx,%rsi), %eax
>    3.  cmpb %al, 1(%rdx,%rdi)
>    4.  jne .LBB27_1
>
>
> Cool, this looks really nice now, and benchmarking confirms it. But before
> we can actually do either of these transforms, we'll also have to fix how
> this is being used. Time for a more complete example:
>
> http://paste2.org/MjW043aZ
>
>    1. # BB#2: # %if.end
>    2.  leal 1(%rdi), %eax
>    3.  leal 1(%rsi), %ebp
>    4.  movb (%rdx,%rax), %al
>    5.  movb (%rdx,%rbp), %bl
>    6.  cmpb %bl, %al
>    7.  jne .LBB27_1
>    8.
>    9. # ... snip ...
>    10.
>    11. .LBB27_1: # %if.then
>    12.  cmpb %al, %bl
>    13.  setb %al
>    14. .LBB27_37: # %return
>    15.  movzbl %al, %eax
>    16.  popq %rbx
>    17.  popq %rbp
>    18.  ret
>
>
> Ok, WTF. We 'cmpb %bl, %al', jump, and then 'cmpb %al, %bl'???? This makes
> no sense. *every* predecessor looks like this. The reason for the LB27_37
> label sneaking around there is because there are even more blocks which end
> in essentially the same thing but with different registers instead of %al
> and %bl. Whatever the registers, its always the same pattern: "cmp X, Y;
>  jne ....;  cmp Y, X;  setb %al". And if we didn't re-do the "cmp", it
> wouldn't have mattered that they were different registers!!! So lets fix
> that:
>
> http://paste2.org/z2gwXvAY
>
>    1. # BB#2: # %if.end
>    2.  movzbl 1(%rdx,%rsi), %eax
>    3.  cmpb %al, 1(%rdx,%rdi)
>    4.  jne .LBB27_1
>    5. # ... <snip> ...
>    6.
>    7. .LBB27_1: # %if.then
>    8.  seta %al
>    9.  movzbl %al, %eax
>    10.  ret
>
>
> Yea, that's a bit more plausible. =] Note that the 'popq' are gone because
> while when we fold away all of the lea's we no longer need the two scratch
> registers. The .LBB27_37 is cone because we no longer have 6 copies
> re-doing a comparison with different registers.
>
> One final tweak that is totally unrelated to the rest of this. We're
> missing a DAG combine for the latch of the inner loop here which looks like
> this:
>
> http://paste2.org/aO0hBw1O
>
>    1. # BB#36: # %if.end396
>    2.  # in Loop: Header=BB27_14 Depth=1
>    3.  addl $8, %edi
>    4.  addl $8, %esi
>    5.  xorl %eax, %eax
>    6.  cmpl %r8d, %edi
>    7.  movl %r8d, %ebp
>    8.  cmovbl %eax, %ebp
>    9.  subl %ebp, %edi
>    10.  cmpl %r8d, %esi
>    11.  movl %r8d, %ebp
>    12.  cmovbl %eax, %ebp
>    13.  subl %ebp, %esi
>    14.  decl (%r9)
>    15.  addl $-8, %r10d
>    16.  jns .LBB27_14
>    17.  jmp .LBB27_37
>    18.
>    19. # ... <snip> ...
>    20.
>    21. .LBB27_37: # %return
>    22.  movzbl %al, %eax
>    23.  popq %rbx
>    24.  popq %rbp
>    25.  ret
>
>
> I'm always really suspicious when I see cmp followed closely by sub.
> Indeed, this is a case of us not realizing that the cmp and sub can be
> combined. When we do that, we also remove this last use of a scratch
> register (%ebp), confirming that we can get rid of all the popq business.
> When we do, we also no longer need to jmp, we can just directly ret:
>
> http://paste2.org/v5K9COgf
>
>    1. # BB#36: # %if.end396
>    2.  # in Loop: Header=BB27_14 Depth=1
>    3.  addl $8, %edi
>    4.  addl $8, %esi
>    5.  movl %edi, %eax
>    6.  subl %r8d, %eax
>    7.  cmovae %eax, %edi
>    8.  movl %esi, %eax
>    9.  subl %r8d, %eax
>    10.  cmovae %eax, %esi
>    11.  decl (%r9)
>    12.  addl $-8, %r10d
>    13.  jns .LBB27_14
>    14.  xorl %eax, %eax
>    15.  ret
>
>
> Yea, that looks more like sane code. =]
>
> All of this combines for another 3.5% on bzip2 on sandybridge, despite
> just changing the mainGtU function manually by hand.
>
> I'll probably break out some of these issues into bugs and such once I
> have more minimal test cases. I think the most important thing is to get
> the correct fix for partial register stalls on sandybridge committed and
> revert this patch so we don't keep paying a tax for a halfway fix. The rest
> is gravy, if very interesting all the same.
>
> Just for reference, here is the complete asm before and after:
>
> Before: http://paste2.org/CC50nde3
> After: http://paste2.org/J3DHbPJC
> - 100 fewer instructions
> - 30% smaller encoded size after assembling
> - 7% faster on sandybridge, 3.5% faster than with this patch (r195496)
>
>
> Also for context, the benchmarking was done with a little compression
> algorithm benchmarking tool that I'm working to open source and will
> hopefully add to the test suite soon after. It's based on bzip2 1.0.6
> (latest sources) and I primarily measured performance by compressing
> linux-3.12.tar with it.
>
> -Chandler
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131228/f3d14e36/attachment.html>