[LLVMbugs] [Bug 18629] New: Insane code generated after unrolling a loop from libquantum
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Mon Jan 27 05:53:23 PST 2014
http://llvm.org/bugs/show_bug.cgi?id=18629
Bug ID: 18629
Summary: Insane code generated after unrolling a loop from
libquantum
Product: libraries
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: normal
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: chandlerc at gmail.com
CC: dnovillo at google.com, grosbach at apple.com,
llvmbugs at cs.uiuc.edu
Classification: Unclassified
First off, yes yes, I know libquantum is crazy. But lets look at this
simplified example which we should at least not do royally bad things to:
% cat x.c
struct S {
int a, b;
unsigned long c;
};
void f(int x, long size, struct S *s) {
int i;
for (i = 0; i < size; ++i)
s[i].c ^= 1ul << x;
}
Nothing crazy. Non-contiguous memory writes, so vectorizing is a bad idea. But
unrolling this kind of loop is a pretty big win. Or at least, it would be, if
we could do a halfway decent job of it.
First off, the use of register pressure in the unroll phase of the loop
vectorizer is deeply flawed, but I'll file that bug separately. I'll force the
number of vectors to be crazy high and grow the threshold I tried to grow to
actually see some wins from unrolling tiny scalar loops without carried
dependencies.
This is essentially an attempt to exploit ILP which is one of the stated goals
of the loop vectorizer's unrolling.
Looking at the IR for this, the canonicalization is somewhat crazy to me: it
coerces the GEPs for each element from something like this:
%base = getelementptr inbounds %struct.S* %s, i64 %index
%0 = getelementptr inbounds %struct.S* %base, i64 0, i32 2
%1 = getelementptr inbounds %struct.S* %base, i64 1, i32 2
%2 = getelementptr inbounds %struct.S* %base, i64 2, i32 2
%3 = getelementptr inbounds %struct.S* %base, i64 3, i32 2
%4 = getelementptr inbounds %struct.S* %base, i64 4, i32 2
%5 = getelementptr inbounds %struct.S* %base, i64 5, i32 2
%6 = getelementptr inbounds %struct.S* %base, i64 6, i32 2
%7 = getelementptr inbounds %struct.S* %base, i64 7, i32 2
Into something like this:
%0 = getelementptr inbounds %struct.S* %s, i64 %index, i32 2
%base.sum1 = or i64 %index, 1
%1 = getelementptr inbounds %struct.S* %s, i64 %base.sum1, i32 2
%base.sum2 = or i64 %index, 2
%2 = getelementptr inbounds %struct.S* %s, i64 %base.sum2, i32 2
%base.sum3 = or i64 %index, 3
%3 = getelementptr inbounds %struct.S* %s, i64 %base.sum3, i32 2
%base.sum4 = or i64 %index, 4
%4 = getelementptr inbounds %struct.S* %s, i64 %base.sum4, i32 2
%base.sum5 = or i64 %index, 5
%5 = getelementptr inbounds %struct.S* %s, i64 %base.sum5, i32 2
%base.sum6 = or i64 %index, 6
%6 = getelementptr inbounds %struct.S* %s, i64 %base.sum6, i32 2
%base.sum7 = or i64 %index, 7
%7 = getelementptr inbounds %struct.S* %s, i64 %base.sum7, i32 2
I have no idea why.
It also expresses the loop in terms of the integer index rather than computing
the end pointer and just using a pointer equality test to terminate the loop. I
don't really care about this, but I thought it might allow codegen to emit a
single arithmetic operation per loop by removing the integer induction
variable. Ironically, we just codegen a separate induction variable anyways and
don't use it for the addresses. This makes no sense to me, but whatever.
Anyways, unless you want the IR to be different from the above going into
codegen, this is an x86 codegen bug. See below for the hilarity. Sorry for the
snark, it's late. =D
% ./bin/clang -S -O2 -o - x.c -mllvm -force-target-max-scalar-unroll=8 -mllvm
-small-loop-cost=40 -mllvm -force-target-num-scalar-regs=32
.file "x.c"
.text
.globl f
.align 16, 0x90
.type f, at function
f: # @f
.cfi_startproc
# BB#0: # %entry
pushq %r15
.Ltmp5:
.cfi_def_cfa_offset 16
pushq %r14
.Ltmp6:
.cfi_def_cfa_offset 24
pushq %r12
.Ltmp7:
.cfi_def_cfa_offset 32
pushq %rbx
<snarky commentary>
Ok, this is already bad. Why did we need 4 registers? The unvectorized version
doesn't need any. I'll give you maybe 1 for the complex peeled loop form, but I
doubt you need 4.
</snarky commentary>
.Ltmp8:
.cfi_def_cfa_offset 40
.Ltmp9:
.cfi_offset %rbx, -40
.Ltmp10:
.cfi_offset %r12, -32
.Ltmp11:
.cfi_offset %r14, -24
.Ltmp12:
.cfi_offset %r15, -16
testq %rsi, %rsi
jle .LBB0_8
# BB#1: # %for.body.lr.ph
movl $1, %r9d
movb %dil, %cl
shlq %cl, %r9
xorl %ecx, %ecx
movq %rsi, %r8
andq $-8, %r8
je .LBB0_5
# BB#2: # %vector.body.preheader
leaq 120(%rdx), %rdi
movq %rsi, %r14
andq $-8, %r14
.align 16, 0x90
.LBB0_3: # %vector.body
# =>This Inner Loop Header: Depth=1
movq -96(%rdi), %r10
xorq %r9, %r10
movq -80(%rdi), %r11
xorq %r9, %r11
movq -64(%rdi), %r15
xorq %r9, %r15
movq -48(%rdi), %r12
xorq %r9, %r12
movq -32(%rdi), %rcx
xorq %r9, %rcx
movq -16(%rdi), %rax
xorq %r9, %rax
movq (%rdi), %rbx
xorq %r9, %rbx
xorq %r9, -112(%rdi)
movq %r10, -96(%rdi)
movq %r11, -80(%rdi)
movq %r15, -64(%rdi)
movq %r12, -48(%rdi)
movq %rcx, -32(%rdi)
movq %rax, -16(%rdi)
movq %rbx, (%rdi)
<snarky commentary>
Ok, wtf. 16 movq instructions for an 8x unroll? That's nutty. There is a
perfectly good xor instruction that accepts a memory operand. We even use it
once here and in the tail loop! Wtf. We're not even extending or struggling
with addressing modes. Please make this use xorq 8 times with memory operands.
=]
</snarky commentary>
subq $-128, %rdi
addq $-8, %r14
<snarky commentary>
Why are we unable to express the loop in terms of the pointer %rdi? That would
save us a register in the loop and reuse the sub's flags for the backedge. We
don't even need to compute the remainder as we compute that with the mask that
guards entering the unrolled loop?
</snarky commentary>
jne .LBB0_3
# BB#4:
movq %r8, %rcx
.LBB0_5: # %middle.block
cmpq %rsi, %rcx
je .LBB0_8
# BB#6: # %for.body.preheader
subq %rcx, %rsi
shlq $4, %rcx
leaq 8(%rdx,%rcx), %rcx
.align 16, 0x90
.LBB0_7: # %for.body
# =>This Inner Loop Header: Depth=1
xorq %r9, (%rcx)
addq $16, %rcx
decq %rsi
jne .LBB0_7
.LBB0_8: # %for.end
popq %rbx
popq %r12
popq %r14
popq %r15
retq
.Ltmp13:
.size f, .Ltmp13-f
.cfi_endproc
.ident "clang version 3.5 "
.section ".note.GNU-stack","", at progbits
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20140127/9b44b23b/attachment.html>
More information about the llvm-bugs
mailing list