[llvm-dev] SCEV and LoopStrengthReduction Formulae

Tue Apr 3 15:00:13 PDT 2018

I am attempting to implement a minor loop strength reduction optimization for
targets that support compare and jump fusion, specifically
TTI::canMacroFuseCmp().  My approach might be wrong; however, I am soliciting
the idea for feedback, so that I can implement this correctly.  My plan is to
add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes  the
following case, but perhaps this should stand alone as  its own pass:

// Example which can be optimized via cmp/jmp fusion.
// clang -O3 -S test.c
extern void g(int);
void f(int *p, long long n) {
    do {
        g(*p++);
    } while (--n);
}

LLVM currently generates the following sequence for x86_64 targets:
LBB0_1:
	movl	(%r15,%rbx,4), %edi
	callq	g
	addq	$1, %rbx
	cmpq	%rbx, %r14
	jne	.LBB0_1

LLVM can perform compare-jump fusion, it already does in certain cases, but not
in the case above.  We can remove the cmp above if we were to perform
the following transformation:
1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero.
1.1) Negate the induction variable, so that the loop increments from -n to zero.
2.0) Set the base pointer, %r15, to be the last address that will be accessed
       from 'p'
2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n * <ele size>
3.0) Remove the cmpq

We can remove the cmp, since we are now counting from -n to zero.  The
termination case for the loop will now be zero.  The add instruction will set
the zero-flag when %rbx reaches zero (target specific of course).

The pointer passed to g() will also be correct, since we access the same items
we normally would have, but the access is with respect to the base
being the last item and subtracting from the last item, via a negative induction
variable.  I imagine the result would look something like the following:
    movq   %rsi,   %rbx                   ; rbx becomes the value n
    leaq     (%rdi, %rbx, 4), %r15  ; <base> = p + (n * 4)
    negq    %rbx                              ; set rbx to be -n, so we count from -n to 0
LBB0_1:
    movl    (%r15,%rbx,4), %edi
    callq    g
    addq   $1, %rbx
    jne      .LBB0_1

I realize this is a micro-op saving a single cycle.  But this reduces the instruction count, one less
instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?

-Matt