[llvm-dev] SCEV and LoopStrengthReduction Formulae

Sat Apr 7 08:22:16 PDT 2018

> 
> I realize this is a micro-op saving a single cycle.  But this reduces the instruction count, one less
> instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition
> to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?

This seems reasonable to me so long as rbx has no other uses that would complicate the problem; I’m not sure how much this occurs in hot code (a loop with an induction variable that isn’t used in the loop), but if it does, I don’t see why not.

As a side note, in a past life, when I used to do x86 SIMD optimization for a living, I did similar tricks pretty much everywhere in DSP functions. It’d be pretty nice if the compiler could do it too.

There is one alternate approach that I recall, which looks like this:

Original code (example, pseudocode):

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  for (int i = 0; i < 16; ++i) {
   uint8x16 a = load16(in1 + i *16); // NOTE: takes an extra addressing op because x86
   uint8x16 b = load16(in2 + i *16); // NOTE: takes an extra addressing op because x86
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
inc i
cmp i, 16
jl loop

LSR’d code:

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  for (int i = 0; i < 16; ++i, in1 += 16, in2 += 16) {
   uint8x16 a = load16(in1);
   uint8x16 b = load16(in2);
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
add in1, 16
add in2, 16
inc i
cmp i, 16
jl loop

your code:

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  for (int i = -16; i < 0; ++i, in1 += 16, in2 += 16) {
   uint8x16 a = load16(in1);
   uint8x16 b = load16(in2);
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
add in1, 16
add in2, 16
inc i
jl loop

ideal code:

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  in1 += 256;
  in2 += 256;
  for (int i = -256; i < 0; ++i) {
   uint8x16 a = load16(in1 + i);
   uint8x16 b = load16(in2 + i);
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
inc i
jl loop

I don’t know, however, if it’s reasonable to teach the compiler to do the clever nonsense necessary to do the last one (requires enough understanding of x86 addressing modes, for one).

—escha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180407/f718f12c/attachment.html>