[LLVMdev] Thumb-2 code generation error in Apple LLVM at all optimization levels

Fri Nov 11 16:14:45 PST 2011

This would be best reported to Apple's Radar bug database at
http://bugreport.apple.com/ but its whole website has been down for a
while.

I have a 100% reproducible Thumb-2 code generation error that occurs
at all of the levels of optimization available in the Xcode 4.2 for
Snow Leopard build settings GUI: -O0, -O1, -O2, -O3 and -Os.

However the bad machine code only occurs in Release builds, never in
Debug builds!  I tried the Debug builds at all levels of optimization
as well.

   $ xcodebuild -version
   Xcode 4.2
   Build version 4C199

   $ /Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/clang --version
   Apple clang version 3.0 (tags/Apple/clang-211.9) (based on LLVM 3.0svn)
   Target: i386-apple-darwin10.8.0
   Thread model: posix

I'm not real clear where to find the part of the toolchain that emits
the Thumb-2 assembly, so I can't tell you that tool's precise version.

   $ uname -a
   Darwin frylock.local 10.8.0 Darwin Kernel Version 10.8.0:
   Tue Jun  7 16:33:36 PDT 2011;
   root:xnu-1504.15.3~1/RELEASE_I386 i386

The Xcode's iPhone and iPad Simulators run iOS Apps that on my 32-bit
MacBook Pro are built as i386 code.  The iOS frameworks (shared
libraries, sort of) that simulated Apps link to are actually shims
that interface to Mac OS X's frameworks.

The i386 code for my simulated App is generated correctly at -Os for
both Release and Debug builds.  That suggests that the problem is in
the Thumb-2 code generation back-end, and not in the LLVM IR.

I've seen lots of reports that the Thumb code that the Apple LLVM
compiler generates for ARMv6 is quite buggy, so that one must disable
Thumb code generation for ARMv6 targets.  However my first-generation
iPad has a Cortex A8 CPU, which is ARMv7, as does my iPhone 4.

It's quite possible that disabling Thumb code generation for at least
this one source file will correct the bad machine code, but Google has
not blessed me with the insight as to how to do that.  It's not done
the same way for LLVM as for GCC.  Have any of you this insight to
spare?

It's going to take me a little while to cook up a minimal test case as
I was up all night <strike>trolling the Internet</strike> working on
my iOS App, so I'm pretty beat.  But when I have more details for you,
I will post a more detailed report as well as a minimal test case that
builds as a complete iOS App at what is now just a placeholder page:

   Apple Xcode 4.2 LLVM Compiler Bug Reports
   http://www.dulcineatech.com/bug-reports/xcode/4.2/llvm/

My App Warp Life is so named because it goes very, very fast, with
many more optimizations coming soon.  The UI has a speed control
slider whose value is scaled, then pass to the usleep() iOS system
call.  usleep() suspends the process for the given number of
microseconds.

I realized just recently that calling usleep with delays that
themselves are insignificant might actually slow my App down quite a
bit, because there is all manner of overhead to making and returning
from even the most trivial system calls.  After measuring my game's
frame rate at the best optimizations I could find, for various kinds
of test data, I set a threshhold of 1/250th of a second.  I never call
usleep() if the configured delay setting is less than that.

The full source of the entire method, and the Release and Debug build
assembly codes are at the end of this mail.  For clarity I show only
the pertinent lines of code right here:

   useconds_t usecs = (useconds_t)( self.delay * (float)500000 );

   if ( usecs >= 4000 ){   // ~ 1/250 sec
      usleep( usecs );       // usecs is ZERO!!!!
   }

self.delay is an Objective-C 2.0 property that holds the current value
of the speed slider.  When set to maximum speed, usecs will always be
zero.  Even so, the branch is ALWAYS taken, despite the source code
ensuring that the branch is only taken when usecs is greater than or
equal to four thousand.

Here is the Thumb-2 assembly for the Release build.

I think the (float)500000 delay scaling factor is meant to be held in
floating point register d8.  I thought at first it might not be
initialized at all, but upon closer examination I think it may
actually be initialized from a program counter-relative 32-bit .long
constant immediately following my method's code.

	.loc	1 388 3
	ldr	r0, [r5]
	ldr	r1, [r4, r0]
	adds	r1, #1
	str	r1, [r4, r0]
	.loc	1 390 64
	mov	r0, r4
	ldr	r1, [r6]
	blx	_objc_msgSend
	vmov	s0, r0
	vmul.f32	d0, d0, d8
	vcvt.u32.f32	d0, d0
	vmov	r0, s0
Ltmp272:
	.loc	1 392 9
	cmp.w	r0, #4000
Ltmp273:
	.loc	1 393 13
	it	hs
	blxhs	_usleep

cmp.w *looks* like a 16-bit comparison with an immediate constant, but
in reality the constant is twelve bits.  The ARM and Thumb instruction
sets have quite severe restrictions on the allowed ranges of immediate
values because the richness of the ARM and Thumb instruction set makes
it hard to find enough bits in the instruction words to express a
wider range of immediate values than is presently possible.

I don't know what the "it hs" instruction does.  I suspect that's
where the problem lies, but "it" is a very common word, and "hs" is
quite common as well, as it is a frequent mispelling for "has".
Perhaps someone who knows Thumb-2 assembly better than I do could
comment.

The assembly for my Debug build is quite unlike that for the Release
build, for every single one of the available optimization levels.
There are quite a few instructions separating the load of the #4000
immediate into r0 and the call to usleep().

I have not yet ensured that there aren't build configuration
differences between my Debug and Release builds, but I don't recall
setting any.  My guess is that the totally different machine code in
Debug is there to make source code debugging work better.

Here is my method's full Objective-C source:

- (void) cycleContinuously
{
	startDate = [[NSDate alloc] init];
	generation = 0;

	while ( mRunning ){
		[self cycle];

		++generation;

		useconds_t usecs = (useconds_t)( self.delay * (float)500000 );

        if ( usecs >= 4000 ){   // ~ 1/250 sec
            usleep( usecs );
        }
	}

	NSDate *endDate = [[NSDate alloc] init];

	NSTimeInterval elapsed = [endDate timeIntervalSinceDate: startDate];

	[startDate release];
	[endDate release];

	printf( "Speed: %f gen/sec\n", ( (float)generation ) / elapsed );

	return;
}

The assembly for the problem area of my code is completely identical
for each available optimization setting for Release builds.  I haven't
made such detailed comparisons for the Debug builds yet.

Here is the Release assembly at -Os:

	.align	2
	.code	16
	.thumb_func	"-[LifeGrid cycleContinuously]"
"-[LifeGrid cycleContinuously]":
Ltmp265:
Lfunc_begin24:
	.loc	1 380 0
	.loc	1 380 1 prologue_end
	push	{r4, r5, r6, r7, lr}
	add	r7, sp, #12
	push.w	{r8, r10, r11}
	vpush	{d8}
	sub	sp, #4
	.loc	1 382 2
Ltmp266:
	movw	r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_0+4))
Ltmp267:
	mov	r4, r0
Ltmp268:
	movt	r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_0+4))
	movw	r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_1+4))
	movt	r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_1+4))
LPC24_0:
	add	r1, pc
LPC24_1:
	add	r0, pc
	ldr	r1, [r1]
	ldr	r0, [r0]
	blx	_objc_msgSend
	movw	r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_2+4))
	movt	r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_2+4))
LPC24_2:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	movw	r11, :lower16:(_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_3+4))
	movt	r11, :upper16:(_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_3+4))
LPC24_3:
	add	r11, pc
	ldr.w	r1, [r11]
	.loc	1 383 2
	movw	r5, :lower16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_4+4))
	movt	r5, :upper16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_4+4))
LPC24_4:
	add	r5, pc
	.loc	1 382 2
	str	r0, [r4, r1]
	movs	r1, #0
	.loc	1 383 2
	ldr	r0, [r5]
	.loc	1 385 2
	movw	r8, :lower16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_5+4))
	movt	r8, :upper16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_5+4))
LPC24_5:
	add	r8, pc
	.loc	1 383 2
	str	r1, [r4, r0]
	.loc	1 385 2
	ldr.w	r0, [r8]
	ldrb	r0, [r4, r0]
	cbz	r0, LBB24_3
Ltmp269:
	.loc	1 386 3
	movw	r10, :lower16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_6+4))
	vldr.32	s16, LCPI24_0
	movt	r10, :upper16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_6+4))
	.loc	1 390 64
	movw	r6, :lower16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_7+4))
	movt	r6, :upper16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_7+4))
	.loc	1 386 3
LPC24_6:
	add	r10, pc
	.loc	1 390 64
LPC24_7:
	add	r6, pc
LBB24_2:
Ltmp270:
	.loc	1 386 3
	ldr.w	r1, [r10]
Ltmp271:
	mov	r0, r4
	blx	_objc_msgSend
	.loc	1 388 3
	ldr	r0, [r5]
	ldr	r1, [r4, r0]
	adds	r1, #1
	str	r1, [r4, r0]
	.loc	1 390 64
	mov	r0, r4
	ldr	r1, [r6]
	blx	_objc_msgSend
	vmov	s0, r0
	vmul.f32	d0, d0, d8
	vcvt.u32.f32	d0, d0
	vmov	r0, s0
Ltmp272:
	.loc	1 392 9
	cmp.w	r0, #4000
Ltmp273:
	.loc	1 393 13
	it	hs
	blxhs	_usleep
Ltmp274:
	.loc	1 385 2
	ldr.w	r0, [r8]
	ldrb	r0, [r4, r0]
	cmp	r0, #0
	bne	LBB24_2
LBB24_3:
Ltmp275:
	.loc	1 382 2
	movw	r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_8+4))
	movt	r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_7-(LPC24_8+4))
LPC24_8:
	add	r0, pc
	.loc	1 397 41
	ldr	r1, [r0]
Ltmp276:
	.loc	1 382 2
	movw	r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_9+4))
	movt	r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_9+4))
LPC24_9:
	add	r0, pc
	.loc	1 397 41
	ldr	r0, [r0]
	blx	_objc_msgSend
	.loc	1 382 2
	movw	r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_10+4))
	movt	r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_-(LPC24_10+4))
LPC24_10:
	add	r1, pc
	.loc	1 397 41
	ldr	r1, [r1]
	blx	_objc_msgSend
	.loc	1 399 69
	movw	r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_66-(LPC24_11+4))
	movt	r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_66-(LPC24_11+4))
	.loc	1 397 41
	mov	r6, r0
	.loc	1 399 69
	ldr.w	r0, [r11]
LPC24_11:
	add	r1, pc
	ldr	r1, [r1]
	ldr	r2, [r4, r0]
	mov	r0, r6
	blx	_objc_msgSend
	str	r0, [sp]
	.loc	1 401 2
	movw	r8, :lower16:(L_OBJC_SELECTOR_REFERENCES_68-(LPC24_12+4))
	movt	r8, :upper16:(L_OBJC_SELECTOR_REFERENCES_68-(LPC24_12+4))
	ldr.w	r0, [r11]
LPC24_12:
	add	r8, pc
	.loc	1 399 69
	mov	r10, r1
	.loc	1 401 2
	ldr.w	r1, [r8]
	ldr	r0, [r4, r0]
	blx	_objc_msgSend
	.loc	1 402 2
	ldr.w	r1, [r8]
	mov	r0, r6
	blx	_objc_msgSend
	.loc	1 404 2
	ldr	r0, [r5]
	add	r0, r4
	vldr.32	s0, [r0]
	vcvt.f32.s32	d0, d0
	.loc	1 399 69
	ldr	r0, [sp]
	vmov	d17, r0, r10
Ltmp277:
	.loc	1 404 2
	movw	r0, :lower16:(L_.str69-(LPC24_13+4))
	movt	r0, :upper16:(L_.str69-(LPC24_13+4))
	vcvt.f64.f32	d16, s0
LPC24_13:
	add	r0, pc
	vdiv.f64	d16, d16, d17
	vmov	r1, r2, d16
	blx	_printf
Ltmp278:
	.loc	1 407 1
	add	sp, #4
	vpop	{d8}
	pop.w	{r8, r10, r11}
	pop	{r4, r5, r6, r7, pc}
Ltmp279:
	.align	2
LCPI24_0:
	.long	1223959552
Ltmp280:
Lfunc_end24:
Ltmp281:
Leh_func_end24:

Here is the Debug assembly at -Os:
	.align	2
	.code	16
	.thumb_func	"-[LifeGrid cycleContinuously]"
"-[LifeGrid cycleContinuously]":
Ltmp112:
Lfunc_begin24:
	.loc	1 380 0
	push	{r4, r7, lr}
	add	r7, sp, #4
	sub	sp, #44
	mov	r4, sp
	bic	r4, r4, #7
	mov	sp, r4
	movs	r2, #0
	movt	r2, #0
	str	r0, [sp, #40]
	str	r1, [sp, #36]
	.loc	1 382 2 prologue_end
Ltmp113:
	ldr.n	r0, LCPI24_4
LPC24_4:
	add	r0, pc
	ldr	r0, [r0]
	ldr.n	r1, LCPI24_3
LPC24_3:
	add	r1, pc
	ldr	r1, [r1]
	str	r2, [sp, #12]
	blx	_objc_msgSend
	ldr.n	r1, LCPI24_2
LPC24_2:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	ldr	r1, [sp, #40]
	ldr.n	r2, LCPI24_1
LPC24_1:
	add	r2, pc
	ldr	r2, [r2]
	add	r1, r2
	str	r0, [r1]
	.loc	1 383 2
	ldr	r0, [sp, #40]
	ldr.n	r1, LCPI24_0
LPC24_0:
	add	r1, pc
	ldr	r1, [r1]
	add	r0, r1
	ldr	r1, [sp, #12]
	str	r1, [r0]
LBB24_1:
	.loc	1 385 2
	ldr	r0, [sp, #40]
	movw	r1, :lower16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_14+4))
	movt	r1, :upper16:(_OBJC_IVAR_$_LifeGrid.mRunning-(LPC24_14+4))
LPC24_14:
	add	r1, pc
	ldr	r1, [r1]
	ldrb	r0, [r0, r1]
	movs	r1, #0
	cmp	r0, #0
	it	ne
	movne	r1, #1
	tst.w	r1, #1
	beq	LBB24_5
	movw	r0, #4000
	movt	r0, #0
	.loc	1 386 3
Ltmp114:
	ldr	r1, [sp, #40]
	movw	r2, :lower16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_15+4))
	movt	r2, :upper16:(L_OBJC_SELECTOR_REFERENCES_59-(LPC24_15+4))
LPC24_15:
	add	r2, pc
	ldr	r2, [r2]
	str	r0, [sp, #8]
	mov	r0, r1
	mov	r1, r2
	blx	_objc_msgSend
	.loc	1 388 3
	ldr	r0, [sp, #40]
	movw	r1, :lower16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_16+4))
	movt	r1, :upper16:(_OBJC_IVAR_$_LifeGrid.generation-(LPC24_16+4))
LPC24_16:
	add	r1, pc
	ldr	r1, [r1]
	mov	r2, r1
	ldr	r2, [r0, r2]
	adds	r2, #1
	str	r2, [r0, r1]
	.loc	1 390 64
	ldr	r0, [sp, #40]
Ltmp115:
	movw	r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_17+4))
	movt	r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_64-(LPC24_17+4))
LPC24_17:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	vmov	s0, r0
	vmov.f64	d1, d16
	vldr.32	s1, LCPI24_14
	vmov.f64	d2, d1
	vmov.f32	s4, s1
	vmov.f64	d3, d1
	vmov.f32	s6, s0
	vmul.f32	d16, d3, d2
	vmov.f64	d2, d16
	vmov.f32	s0, s4
	vmov.f32	s2, s0
	vcvt.u32.f32	d16, d1
	vmov.f64	d1, d16
	vmov.f32	s0, s2
	vmov	r0, s0
	str	r0, [sp, #32]
	.loc	1 392 9
	ldr	r0, [sp, #32]
	ldr	r1, [sp, #8]
	cmp	r0, r1
	blo	LBB24_4
	.loc	1 393 13
Ltmp116:
	ldr	r0, [sp, #32]
	bl	_usleep
	str	r0, [sp, #4]
Ltmp117:
LBB24_4:
	.loc	1 395 2
	b	LBB24_1
Ltmp118:
LBB24_5:
	.loc	1 397 41
	ldr.n	r0, LCPI24_13
LPC24_13:
	add	r0, pc
	ldr	r0, [r0]
	ldr.n	r1, LCPI24_12
LPC24_12:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	ldr.n	r1, LCPI24_11
LPC24_11:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	str	r0, [sp, #28]
	.loc	1 399 69
	ldr	r0, [sp, #28]
	ldr	r1, [sp, #40]
	ldr.n	r2, LCPI24_10
LPC24_10:
	add	r2, pc
	ldr	r2, [r2]
	add	r1, r2
	ldr	r2, [r1]
	ldr.n	r1, LCPI24_9
LPC24_9:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	vmov	d16, r0, r1
	vstr.64	d16, [sp, #16]
	.loc	1 401 2
	ldr	r0, [sp, #40]
	ldr.n	r1, LCPI24_8
LPC24_8:
	add	r1, pc
	ldr	r1, [r1]
	add	r0, r1
	ldr	r0, [r0]
	ldr.n	r1, LCPI24_7
LPC24_7:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	.loc	1 402 2
	ldr	r0, [sp, #28]
	ldr.n	r1, LCPI24_6
LPC24_6:
	add	r1, pc
	ldr	r1, [r1]
	blx	_objc_msgSend
	.loc	1 404 2
	ldr	r0, [sp, #40]
	ldr.n	r1, LCPI24_5
LPC24_5:
	add	r1, pc
	ldr	r1, [r1]
	add	r0, r1
	ldr	r0, [r0]
	vmov	s0, r0
	vcvt.f32.s32	s0, s0
	vcvt.f64.f32	d16, s0
	vldr.64	d17, [sp, #16]
	vdiv.f64	d16, d16, d17
	vmov	r1, r2, d16
	movw	r0, :lower16:(L_.str69-(LPC24_18+4))
	movt	r0, :upper16:(L_.str69-(LPC24_18+4))
LPC24_18:
	add	r0, pc
	blx	_printf
	.loc	1 407 1
	str	r0, [sp]
	subs	r4, r7, #4
	mov	sp, r4
	pop	{r4, r7, pc}
	.align	2
LCPI24_0:
	.long	_OBJC_IVAR_$_LifeGrid.generation-(LPC24_0+4)
	.align	2
LCPI24_1:
	.long	_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_1+4)
	.align	2
LCPI24_2:
	.long	L_OBJC_SELECTOR_REFERENCES_-(LPC24_2+4)
	.align	2
LCPI24_3:
	.long	L_OBJC_SELECTOR_REFERENCES_7-(LPC24_3+4)
	.align	2
LCPI24_4:
	.long	L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_4+4)
	.align	2
LCPI24_5:
	.long	_OBJC_IVAR_$_LifeGrid.generation-(LPC24_5+4)
	.align	2
LCPI24_6:
	.long	L_OBJC_SELECTOR_REFERENCES_68-(LPC24_6+4)
	.align	2
LCPI24_7:
	.long	L_OBJC_SELECTOR_REFERENCES_68-(LPC24_7+4)
	.align	2
LCPI24_8:
	.long	_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_8+4)
	.align	2
LCPI24_9:
	.long	L_OBJC_SELECTOR_REFERENCES_66-(LPC24_9+4)
	.align	2
LCPI24_10:
	.long	_OBJC_IVAR_$_LifeGrid.startDate-(LPC24_10+4)
	.align	2
LCPI24_11:
	.long	L_OBJC_SELECTOR_REFERENCES_-(LPC24_11+4)
	.align	2
LCPI24_12:
	.long	L_OBJC_SELECTOR_REFERENCES_7-(LPC24_12+4)
	.align	2
LCPI24_13:
	.long	L_OBJC_CLASSLIST_REFERENCES_$_62-(LPC24_13+4)
	.align	2
LCPI24_14:
	.long	1223959552
Ltmp119:
Lfunc_end24:
Ltmp120:
Leh_func_end24:

Man I gotta catch some ZZZs, I'm totally thrashed.  I'll do my best
just to take a little nap, but the chances are pretty good I won't get
outta bed unilt Monday!

-- 
Don Quixote de la Mancha
Dulcinea Technologies Corporation
Software of Elegance and Beauty
http://www.dulcineatech.com
quixote at dulcineatech.com