[llvm-dev] [hexagon][PowerPC] code regression (sub-optimal code) on LLVM 9 when generating hardware loops, and the "llvm.uadd" intrinsic.

Sun Jun 30 12:04:17 PDT 2019

Hi All,

The following code :

void hexagon2( int *a, int *res )
{
  int i = 100; 
  while ( i-- ) {
    *res++ = *a++;
  }
}

gets compiled as a sub-optimal Software loop on LLVM 9.0 instead of a Hardware loop, whereas it was compiled as a Hardware Loop in LLVM 7.0. 

This is the final assembly code generated by LLVM 9.0 :

	.text
	.file	"main.c"
	.globl	hexagon2                // -- Begin function hexagon2
	.p2align	2
	.type	hexagon2, at function
hexagon2:                               // @hexagon2
// %bb.0:                               // %entry.old
	{
		p0 = cmp.gtu(r0,r1); if (p0.new) jump:nt .LBB0_5
		r2 = r0
		allocframe(#0)
	}                               // encoding: [A,0x41'A',A,0x15'A',0x00,0x3c,0x02,0x70]
                                        //   fixup A - offset: 0, value: .LBB0_5, kind: fixup_Hexagon_B9_PCREL
// %bb.1:                               // %entry.old
	{
		r0 = sub(r1,r0)
	}                               // encoding: [0x00,0xc1,0x20,0xf3]
	{
		if (p0.new) jump:nt .LBB0_5
		p0 = cmp.gt(r0,#399)
	}                               // encoding: [A,0x48'A',A,0x5c'A',0xe0,0xf1,0x40,0x75]
                                        //   fixup A - offset: 0, value: .LBB0_5, kind: fixup_Hexagon_B15_PCREL
// %bb.2:
	{
		r0 = #-100
	}                               // encoding: [0x80,0xf3,0xdf,0x78]
.LBB0_3:                                // %while.body
                                        // =>This Inner Loop Header: Depth=1
	{
		r3 = add(r0,#1)
		r4 = memw(r2++#4)
		memw(r1++#4) = r4.new
	}                               // encoding: [0x23,0x40,0x00,0xb0,0x24,0x40,0x82,0x9b,0x08,0xd2,0xa1,0xab]
	{
		p0 = cmp.gtu(r0,r3); if (!p0.new) jump:t .LBB0_3
		r0 = r3
	}                               // encoding: [A,0x63'A',0x40'A',0x15'A',0x00,0xc0,0x63,0x70]
                                        //   fixup A - offset: 0, value: .LBB0_3, kind: fixup_Hexagon_B9_PCREL
// %bb.4:                               // %while.end
	{
		r31:30 = dealloc_return(r30):raw
	}                               // encoding: [0x1e,0xc0,0x1e,0x96]
.LBB0_5:                                // %while.body.rtli
	{
		call memmove
		r1:0 = combine(r2,r1)
		r2 = #400
	}                               // encoding: [A,0x40'A',A,0x5a'A',0x00,0x41,0x02,0xf5,0x02,0xf2,0x00,0x78]
                                        //   fixup A - offset: 0, value: memmove, kind: fixup_Hexagon_B22_PCREL
	{
		r31:30 = dealloc_return(r30):raw
	}                               // encoding: [0x1e,0xc0,0x1e,0x96]
.Lfunc_end0:
	.size	hexagon2, .Lfunc_end0-hexagon2
                                        // -- End function

This is the assembly code generated by LLVM 7.0 :

	.text
	.file	"main.c"
	.globl	hexagon2                // -- Begin function hexagon2
	.p2align	2
	.type	hexagon2, at function
hexagon2:                               // @hexagon2
// %bb.0:                               // %entry.old
	{
		p0 = cmp.gtu(r0,r1); if (p0.new) jump:nt .LBB0_5
		r2 = r0
		allocframe(#0)
	}                               // encoding: [A,0x41'A',A,0x15'A',0x00,0x3c,0x02,0x70]
                                        //   fixup A - offset: 0, value: .LBB0_5, kind: fixup_Hexagon_B9_PCREL
// %bb.1:                               // %entry.old
	{
		r0 = sub(r1,r0)
	}                               // encoding: [0x00,0xc1,0x20,0xf3]
	{
		if (p0.new) jump:nt .LBB0_5
		p0 = cmp.gt(r0,#399)
	}                               // encoding: [A,0x48'A',A,0x5c'A',0xe0,0xf1,0x40,0x75]
                                        //   fixup A - offset: 0, value: .LBB0_5, kind: fixup_Hexagon_B15_PCREL
// %bb.2:                               // %while.body.preheader
	{
		loop0(.LBB0_3,#100)
	}                               // encoding: [0x20'A',0xc0'A',0x03'A',0x69'A']
                                        //   fixup A - offset: 0, value: .LBB0_3, kind: fixup_Hexagon_B7_PCREL
.Ltmp0:                                 // Block address taken
.LBB0_3:                                // %while.body
                                        // =>This Inner Loop Header: Depth=1
	{
		r0 = memw(r2++#4)
		memw(r1++#4) = r0.new
	} :endloop0                     // encoding: [0x20,0x80,0x82,0x9b,0x08,0xd2,0xa1,0xab]
// %bb.4:                               // %while.end
	{
		r31:30 = dealloc_return(r30):raw
	}                               // encoding: [0x1e,0xc0,0x1e,0x96]
.LBB0_5:                                // %while.body.rtli
	{
		call memmove
		r1:0 = combine(r2,r1)
		r2 = #400
	}                               // encoding: [A,0x40'A',A,0x5a'A',0x00,0x41,0x02,0xf5,0x02,0xf2,0x00,0x78]
                                        //   fixup A - offset: 0, value: memmove, kind: fixup_Hexagon_B22_PCREL
	{
		r31:30 = dealloc_return(r30):raw
	}                               // encoding: [0x1e,0xc0,0x1e,0x96]
.Lfunc_end0:
	.size	hexagon2, .Lfunc_end0-hexagon2
                                        // -- End function

The code generated by LLVM 7.0 is better than LLVM 9.0 because 9.0 did not made use of Hardware loops.  This is in my opinion a bad regression from some earlier version. This is not an isolated case, more cases of the same LLVM 9 ‘defect’ are easy to find.

I have investigated the issue and I identified the root cause of it, which is related with the initial use of the “llvm.uadd" intrinsic in LLVM 9.0 to increment the loop Induction Variable, instead of an “add” instruction like LLVM 7.0. 

This is the while.body excerpt after "CodeGen Prepare” in LLVM 9.0

while.body:                                       ; preds = %entry.old, %while.body
  %lsr.iv = phi i32 [ %math, %while.body ], [ -100, %entry.old ]
  %res.addr.04 = phi i32* [ %cgep1, %while.body ], [ %res, %entry.old ]
  %a.addr.03 = phi i32* [ %cgep, %while.body ], [ %a, %entry.old ]
  %6 = load i32, i32* %a.addr.03, align 4, !tbaa !2
  store i32 %6, i32* %res.addr.04, align 4, !tbaa !2
  %7 = call { i32, i1 } @llvm.uadd.with.overflow.i32(i32 %lsr.iv, i32 1)
  %math = extractvalue { i32, i1 } %7, 0
  %ov = extractvalue { i32, i1 } %7, 1
  %cgep = getelementptr inbounds i32, i32* %a.addr.03, i32 1
  %cgep1 = getelementptr inbounds i32, i32* %res.addr.04, i32 1
  br i1 %ov, label %while.end, label %while.body

And this is the same excerpt on  LLVM 7.0

while.body:                                       ; preds = %while.body.preheader, %while.body
  %lsr.iv = phi i32 [ -100, %while.body.preheader ], [ %lsr.iv.next, %while.body ]
  %res.addr.04 = phi i32* [ %cgep1, %while.body ], [ %res, %while.body.preheader ]
  %a.addr.03 = phi i32* [ %cgep, %while.body ], [ %a, %while.body.preheader ]
  %6 = load i32, i32* %a.addr.03, align 4, !tbaa !2
  store i32 %6, i32* %res.addr.04, align 4, !tbaa !2
  %lsr.iv.next = add nsw i32 %lsr.iv, 1
  %tobool = icmp eq i32 %lsr.iv.next, 0
  %cgep = getelementptr inbounds i32, i32* %a.addr.03, i32 1
  %cgep1 = getelementptr inbounds i32, i32* %res.addr.04, i32 1
  br i1 %tobool, label %while.end, label %while.body

LLVM 9 uses “llvm.uadd”. This finally prevents the “Hexagon Hardware Loops” pass to recognise a hardware loop pattern, resulting in sub-optimal code, specially compared with what LLVM 7.0 produces.

The code (excerpt) just before the Hexagon Hardware Loops pass on LLVM 9 is this:

bb.5:
; predecessors: %bb.1
  successors: %bb.3(0x80000000); %bb.3(100.00%)

  %8:intregs = A2_tfrsi -100
  J2_jump %bb.3, implicit-def $pc

bb.3.while.body:
; predecessors: %bb.3, %bb.5
  successors: %bb.4(0x04000000), %bb.3(0x7c000000); %bb.4(3.12%), %bb.3(96.88%)

  %0:intregs = PHI %8:intregs, %bb.5, %3:intregs, %bb.3
  %1:intregs = PHI %7:intregs, %bb.5, %5:intregs, %bb.3
  %2:intregs = PHI %6:intregs, %bb.5, %4:intregs, %bb.3
  %13:intregs, %4:intregs = L2_loadri_pi %2:intregs(tied-def 1), 4 :: (load 4 from %ir.a.addr.03, !tbaa !2)
  %5:intregs = S2_storeri_pi %1:intregs(tied-def 0), 4, %13:intregs :: (store 4 into %ir.res.addr.04, !tbaa !2)
  %3:intregs = A2_addi %0:intregs, 1
  %14:predregs = C2_cmpgtu %0:intregs, %3:intregs
  J2_jumpf %14:predregs, %bb.3, implicit-def dead $pc
  J2_jump %bb.4, implicit-def dead $pc

The same code on LLVM 7 is this:

bb.2.while.body.preheader:
; predecessors: %bb.1
  successors: %bb.4(0x80000000); %bb.4(200.00%)

  %11:intregs = A2_tfrsi -100
  J2_jump %bb.4, implicit-def dead $pc

bb.4.while.body:
; predecessors: %bb.2, %bb.4
  successors: %bb.5(0x04000000), %bb.4(0x7c000000); %bb.5(3.12%), %bb.4(96.88%)

  %0:intregs = PHI %11:intregs, %bb.2, %3:intregs, %bb.4
  %1:intregs = PHI %7:intregs, %bb.2, %5:intregs, %bb.4
  %2:intregs = PHI %6:intregs, %bb.2, %4:intregs, %bb.4
  %12:intregs, %4:intregs = L2_loadri_pi %2:intregs, 4 :: (load 4 from %ir.a.addr.03, !tbaa !2)
  %5:intregs = S2_storeri_pi %1:intregs, 4, %12:intregs :: (store 4 into %ir.res.addr.04, !tbaa !2)
  %3:intregs = A2_addi %0:intregs, 1
  %13:predregs = C2_cmpeqi %3:intregs, 0
  J2_jumpf %13:predregs, %bb.4, implicit-def dead $pc
  J2_jump %bb.5, implicit-def dead $pc

The differences above allow LLVM 7 to turn %13, %3, %11 into a hardware Loop as shown in the assembly code earlier in this message. However, LLVM 9 can’t identify a Hardware loop pattern due to the odd C2_cmpgtu instruction that gets generated. This instruction is a consequence of the introduction of the “llvm.addu” intrinsic that I showed earlier.

I am presenting here the case of Hexagon, but I suspect the same sub-optimal code may happen for the PowerPC (not checked yet).

All the code excerpts were obtained with -Os flags

I’m highly interested in getting in contact with someone familiar with the Hexagon/PowerPC targets regarding this subject. Any pointers to the right persons would be appreciated. My interest comes from the fact that I am proposing an improvement on the LSR pass that affects all targets and I need hardware loops to be properly generated in LLVM 9 like they used to be in LLVM 7. 

Thanks

John.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190630/508bac91/attachment.html>