[PATCH] D21560: Relax the clearance calculating for breaking partial register dependency.

Tue Jun 21 13:03:53 PDT 2016

danielcdh added a comment.

Here is a testcase extracted from internal benchmark, which runs 2% faster on sandybridge if clearance threshold is changed from 16 to 64. For this extracted testcase itself, when threshold is set at 16, the cvt2si for "g" will not have xor inserted to break dependency. As a result, the inner loop consumes 13 cycles (comparing with 11 cycles if breaking the dependency for all "r", "g" and "b").

int datas[10000];
int datad[10000];
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) __attribute__ ((noinline));
void foo(float r, float g, float b, int s, int d, int h, int w, int *datas, int *datad) {

  int i, j;
  for (i = 0; i < h; i++) {
      int *lines = datas + i * s;
      int *lined = datad + i * d;
      for (j = 0; j < w; j++) {
          int word = *(lines + j);
          int val = (int)(r * ((word >> 8) & 0xff) +
                          g * ((word >> 16) & 0xff) +
                          b * ((word >> 24) & 0xff) + 0.5);
          *((char *)(lined) + j) = val;
      }
  }

}
int main() {

  for (int i = 0; i < 100000; i++) {
    foo(2.0, 3.0, 4.0, 100, 100, 100, 100, datas,  datad);
  }
  return 0;

}

I don't have public benchmark result for this change yet. For internal benchmarks, no noticeable code size change has been observed. It has 2% speedup on the benchmark that motivated this patch, and has no performance impact on any other internal benchmarks.

http://reviews.llvm.org/D21560