[llvm-dev] SCEV and LoopStrengthReduction Formulae

Wed Apr 4 11:06:41 PDT 2018

> From: Gopalasubramanian, Ganesh <Ganesh.Gopalasubramanian at amd.com>
> Sent: Wednesday, April 4, 2018 1:11 AM
> To: Davis, Matthew <Matthew.Davis at sony.com>
> 
> > 	cmpq	%rbx, %r14
> > 	jne	.LBB0_1
> >
> > LLVM can perform compare-jump fusion, it already does in certain
> > cases, but not in the case above.  We can remove the cmp above if we
> > were to perform the following transformation:
> 
> Do you mean branch-fusion (https://en.wikichip.org/wiki/macro-
> operation_fusion)?

Thanks for your reply.  

After looking at your link and the x86 instruction manual, Intel Architecture
Optimization manual (3.4.2.2), I realized that the fusion is performed
automatically by the hardware, on the cmp+jne instructions. What I  am trying to
achieve is the removal of the cmp.  Which would result with one less instruction 
in the add/cmp/jmp sequence.  Since 'add' sets the zero flag and the jmp checks
that flag.  So the change would eliminate the fusible cmp+jmp, resulting
in just add+jmp, in this case. 

On Sandy Bridge architectures, which introduce the capability of fusing ADD (and
a few other instructions), this proposed optimization would remove the cmp from
the add/cmp/jmp sequence, leaving just the add+jmp, which is fusible. This cmp 
removal would eliminate a single fetch-decode, in a potentially hot path (loop).
The benefit we get from the cmp removal is one less instruction to decode, and
less pressure on the decoder if the decoded instruction cache were filled-up or 
flushed.

> Is there any more limitation why these two or not fused?
Excuse my confusion here.  The hardware might have been fusing the cmp+jmp
I was hoping to just remove the cmp and avoid the instruction entirely.

-Matt

> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> > via llvm- dev
> > Sent: Wednesday, April 4, 2018 3:30 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: [llvm-dev] SCEV and LoopStrengthReduction Formulae
> >
> > I am attempting to implement a minor loop strength reduction
> > optimization for targets that support compare and jump fusion,
> > specifically TTI::canMacroFuseCmp().  My approach might be wrong;
> > however, I am soliciting the idea for feedback, so that I can
> > implement this correctly.  My plan is to add a Supplemental LSR
> > formula to LoopStrengthReduce.cpp that optimizes the following case, but
> perhaps this should stand alone as  its own pass:
> >
> > // Example which can be optimized via cmp/jmp fusion.
> > // clang -O3 -S test.c
> > extern void g(int);
> > void f(int *p, long long n) {
> >     do {
> >         g(*p++);
> >     } while (--n);
> > }
> >
> > LLVM currently generates the following sequence for x86_64 targets:
> > LBB0_1:
> > 	movl	(%r15,%rbx,4), %edi
> > 	callq	g
> > 	addq	$1, %rbx
> > 	cmpq	%rbx, %r14
> > 	jne	.LBB0_1
> >
> > LLVM can perform compare-jump fusion, it already does in certain
> > cases, but not in the case above.  We can remove the cmp above if we
> > were to perform the following transformation:
> > 1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero.
> > 1.1) Negate the induction variable, so that the loop increments from -n to zero.
> > 2.0) Set the base pointer, %r15, to be the last address that will be accessed
> >        from 'p'
> > 2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n *
> > <ele size>
> > 3.0) Remove the cmpq
> >
> > We can remove the cmp, since we are now counting from -n to zero.  The
> > termination case for the loop will now be zero.  The add instruction
> > will set the zero-flag when %rbx reaches zero (target specific of course).
> >
> > The pointer passed to g() will also be correct, since we access the
> > same items we normally would have, but the access is with respect to
> > the base being the last item and subtracting from the last item, via a
> > negative induction variable.  I imagine the result would look something like the
> following:
> >     movq   %rsi,   %rbx                   ; rbx becomes the value n
> >     leaq     (%rdi, %rbx, 4), %r15  ; <base> = p + (n * 4)
> >     negq    %rbx                              ; set rbx to be -n, so we count from -n to 0
> > LBB0_1:
> >     movl    (%r15,%rbx,4), %edi
> >     callq    g
> >     addq   $1, %rbx
> >     jne      .LBB0_1
> >
> > I realize this is a micro-op saving a single cycle.  But this reduces
> > the instruction count, one less instr to decode in a potentially hot
> > path. If this all makes sense, and seems like a reasonable addition to
> > llvm, would it make sense to implement this as a supplemental LSR formula, or
> as a separate pass?
> >
> > -Matt
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev