[LLVMbugs] [Bug 18629] New: Insane code generated after unrolling a loop from libquantum

Mon Jan 27 05:53:23 PST 2014

http://llvm.org/bugs/show_bug.cgi?id=18629

            Bug ID: 18629
           Summary: Insane code generated after unrolling a loop from
                    libquantum
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: chandlerc at gmail.com
                CC: dnovillo at google.com, grosbach at apple.com,
                    llvmbugs at cs.uiuc.edu
    Classification: Unclassified

First off, yes yes, I know libquantum is crazy. But lets look at this
simplified example which we should at least not do royally bad things to:

% cat x.c
struct S {
  int a, b;
  unsigned long c;
};

void f(int x, long size, struct S *s) {
  int i;
  for (i = 0; i < size; ++i)
    s[i].c ^= 1ul << x;
}

Nothing crazy. Non-contiguous memory writes, so vectorizing is a bad idea. But
unrolling this kind of loop is a pretty big win. Or at least, it would be, if
we could do a halfway decent job of it.

First off, the use of register pressure in the unroll phase of the loop
vectorizer is deeply flawed, but I'll file that bug separately. I'll force the
number of vectors to be crazy high and grow the threshold I tried to grow to
actually see some wins from unrolling tiny scalar loops without carried
dependencies.

This is essentially an attempt to exploit ILP which is one of the stated goals
of the loop vectorizer's unrolling.

Looking at the IR for this, the canonicalization is somewhat crazy to me: it
coerces the GEPs for each element from something like this:

  %base = getelementptr inbounds %struct.S* %s, i64 %index
  %0 = getelementptr inbounds %struct.S* %base, i64 0, i32 2
  %1 = getelementptr inbounds %struct.S* %base, i64 1, i32 2
  %2 = getelementptr inbounds %struct.S* %base, i64 2, i32 2
  %3 = getelementptr inbounds %struct.S* %base, i64 3, i32 2
  %4 = getelementptr inbounds %struct.S* %base, i64 4, i32 2
  %5 = getelementptr inbounds %struct.S* %base, i64 5, i32 2
  %6 = getelementptr inbounds %struct.S* %base, i64 6, i32 2
  %7 = getelementptr inbounds %struct.S* %base, i64 7, i32 2

Into something like this:

  %0 = getelementptr inbounds %struct.S* %s, i64 %index, i32 2
  %base.sum1 = or i64 %index, 1
  %1 = getelementptr inbounds %struct.S* %s, i64 %base.sum1, i32 2
  %base.sum2 = or i64 %index, 2
  %2 = getelementptr inbounds %struct.S* %s, i64 %base.sum2, i32 2
  %base.sum3 = or i64 %index, 3
  %3 = getelementptr inbounds %struct.S* %s, i64 %base.sum3, i32 2
  %base.sum4 = or i64 %index, 4
  %4 = getelementptr inbounds %struct.S* %s, i64 %base.sum4, i32 2
  %base.sum5 = or i64 %index, 5
  %5 = getelementptr inbounds %struct.S* %s, i64 %base.sum5, i32 2
  %base.sum6 = or i64 %index, 6
  %6 = getelementptr inbounds %struct.S* %s, i64 %base.sum6, i32 2
  %base.sum7 = or i64 %index, 7
  %7 = getelementptr inbounds %struct.S* %s, i64 %base.sum7, i32 2

I have no idea why.

It also expresses the loop in terms of the integer index rather than computing
the end pointer and just using a pointer equality test to terminate the loop. I
don't really care about this, but I thought it might allow codegen to emit a
single arithmetic operation per loop by removing the integer induction
variable. Ironically, we just codegen a separate induction variable anyways and
don't use it for the addresses. This makes no sense to me, but whatever.

Anyways, unless you want the IR to be different from the above going into
codegen, this is an x86 codegen bug. See below for the hilarity. Sorry for the
snark, it's late. =D

% ./bin/clang -S -O2 -o - x.c -mllvm -force-target-max-scalar-unroll=8 -mllvm
-small-loop-cost=40 -mllvm -force-target-num-scalar-regs=32                     
        .file   "x.c"
        .text
        .globl  f
        .align  16, 0x90
        .type   f, at function
f:                                      # @f
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %r15
.Ltmp5:
        .cfi_def_cfa_offset 16
        pushq   %r14
.Ltmp6:
        .cfi_def_cfa_offset 24
        pushq   %r12
.Ltmp7:
        .cfi_def_cfa_offset 32
        pushq   %rbx

<snarky commentary>
Ok, this is already bad. Why did we need 4 registers? The unvectorized version
doesn't need any. I'll give you maybe 1 for the complex peeled loop form, but I
doubt you need 4.
</snarky commentary>

.Ltmp8:
        .cfi_def_cfa_offset 40
.Ltmp9:
        .cfi_offset %rbx, -40
.Ltmp10:
        .cfi_offset %r12, -32
.Ltmp11:
        .cfi_offset %r14, -24
.Ltmp12:
        .cfi_offset %r15, -16
        testq   %rsi, %rsi
        jle     .LBB0_8
# BB#1:                                 # %for.body.lr.ph
        movl    $1, %r9d
        movb    %dil, %cl
        shlq    %cl, %r9
        xorl    %ecx, %ecx
        movq    %rsi, %r8
        andq    $-8, %r8
        je      .LBB0_5
# BB#2:                                 # %vector.body.preheader
        leaq    120(%rdx), %rdi
        movq    %rsi, %r14
        andq    $-8, %r14
        .align  16, 0x90
.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movq    -96(%rdi), %r10
        xorq    %r9, %r10
        movq    -80(%rdi), %r11
        xorq    %r9, %r11
        movq    -64(%rdi), %r15
        xorq    %r9, %r15
        movq    -48(%rdi), %r12
        xorq    %r9, %r12
        movq    -32(%rdi), %rcx
        xorq    %r9, %rcx
        movq    -16(%rdi), %rax
        xorq    %r9, %rax
        movq    (%rdi), %rbx
        xorq    %r9, %rbx
        xorq    %r9, -112(%rdi)
        movq    %r10, -96(%rdi)
        movq    %r11, -80(%rdi)
        movq    %r15, -64(%rdi)
        movq    %r12, -48(%rdi)
        movq    %rcx, -32(%rdi)
        movq    %rax, -16(%rdi)
        movq    %rbx, (%rdi)

<snarky commentary>

Ok, wtf. 16 movq instructions for an 8x unroll? That's nutty. There is a
perfectly good xor instruction that accepts a memory operand. We even use it
once here and in the tail loop! Wtf. We're not even extending or struggling
with addressing modes. Please make this use xorq 8 times with memory operands.
=]

</snarky commentary>

        subq    $-128, %rdi
        addq    $-8, %r14

<snarky commentary>

Why are we unable to express the loop in terms of the pointer %rdi? That would
save us a register in the loop and reuse the sub's flags for the backedge. We
don't even need to compute the remainder as we compute that with the mask that
guards entering the unrolled loop?

</snarky commentary>

        jne     .LBB0_3
# BB#4:
        movq    %r8, %rcx
.LBB0_5:                                # %middle.block
        cmpq    %rsi, %rcx
        je      .LBB0_8
# BB#6:                                 # %for.body.preheader
        subq    %rcx, %rsi
        shlq    $4, %rcx
        leaq    8(%rdx,%rcx), %rcx
        .align  16, 0x90
.LBB0_7:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
        xorq    %r9, (%rcx)
        addq    $16, %rcx
        decq    %rsi
        jne     .LBB0_7
.LBB0_8:                                # %for.end
        popq    %rbx
        popq    %r12
        popq    %r14
        popq    %r15
        retq
.Ltmp13:
        .size   f, .Ltmp13-f
        .cfi_endproc

        .ident  "clang version 3.5 "
        .section        ".note.GNU-stack","", at progbits

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20140127/9b44b23b/attachment.html>