[llvm-dev] SCEV and LoopStrengthReduction Formulae
via llvm-dev
llvm-dev at lists.llvm.org
Sat Apr 7 08:22:16 PDT 2018
>
> I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less
> instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
> to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?
This seems reasonable to me so long as rbx has no other uses that would complicate the problem; I’m not sure how much this occurs in hot code (a loop with an induction variable that isn’t used in the loop), but if it does, I don’t see why not.
As a side note, in a past life, when I used to do x86 SIMD optimization for a living, I did similar tricks pretty much everywhere in DSP functions. It’d be pretty nice if the compiler could do it too.
There is one alternate approach that I recall, which looks like this:
Original code (example, pseudocode):
int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
for (int i = 0; i < 16; ++i) {
uint8x16 a = load16(in1 + i *16); // NOTE: takes an extra addressing op because x86
uint8x16 b = load16(in2 + i *16); // NOTE: takes an extra addressing op because x86
accum += psadbw(a, b);
}
return accum;
}
end of loop:
inc i
cmp i, 16
jl loop
LSR’d code:
int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
for (int i = 0; i < 16; ++i, in1 += 16, in2 += 16) {
uint8x16 a = load16(in1);
uint8x16 b = load16(in2);
accum += psadbw(a, b);
}
return accum;
}
end of loop:
add in1, 16
add in2, 16
inc i
cmp i, 16
jl loop
your code:
int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
for (int i = -16; i < 0; ++i, in1 += 16, in2 += 16) {
uint8x16 a = load16(in1);
uint8x16 b = load16(in2);
accum += psadbw(a, b);
}
return accum;
}
end of loop:
add in1, 16
add in2, 16
inc i
jl loop
ideal code:
int add_delta_256(uint8 *in1, uint8 *in2) {
int accum = 0;
in1 += 256;
in2 += 256;
for (int i = -256; i < 0; ++i) {
uint8x16 a = load16(in1 + i);
uint8x16 b = load16(in2 + i);
accum += psadbw(a, b);
}
return accum;
}
end of loop:
inc i
jl loop
I don’t know, however, if it’s reasonable to teach the compiler to do the clever nonsense necessary to do the last one (requires enough understanding of x86 addressing modes, for one).
—escha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180407/f718f12c/attachment.html>
More information about the llvm-dev
mailing list