[PATCH] Optimize unrolled reductions in LoopStrengthReduce

Mon Jan 26 15:08:06 PST 2015

In http://reviews.llvm.org/D7128#112583, @hfinkel wrote:

> I agree, this needs a register-pressure threshold. Also, I thought that the loop vectorizer would also perform this transformation as part of its interleaved unrolling capability. Does it not? If not, perhaps it really belongs there (and the vectorizer already has register pressure heuristics)?

Hi Hal,

The loop vectorizer performs a similar transformation indeed, but does not allow to break dependencies between (already) unrolled iterations of a loop. For instance, consider the following:

  // Original loop.
  for (int i = 0; i < n; i++) 
      for (int j = 0; j < 3; j++)
          r += arr[i][j];

  // After unrolling pass.
  for (int i = 0; i < n; i++)  {
      r += arr[i][0];
      r += arr[i][1];
      r += arr[i][2];
  }

  // After vectorization pass.
  for (int i = 0; i < n; i += 2)  {
      r += arr[i][0];
      r_0 += arr[i+1][0];
      r += arr[i][1];
      r_0 += arr[i+1][1];
      r += arr[i][2];
      r_0 += arr[i+1][2];
  }
  r += r_0;

  // After strength reduction pass with changes.
  for (int i = 0; i < n; i += 2)  {
      r += arr[i][0];
      r_0 += arr[i+1][0];
      r_1 += arr[i][1];
      r_2 += arr[i+1][1];
      r_3+= arr[i][2];
      r_4 += arr[i+1][2];
  }
  r += r_0 + r_1 + r_2 + r_3 + r_4;

The interleaved unrolling in the loop vectorizer seem to add on top of the former unrolling pass. There are two separate dependency chains after vectorization, but the code runs faster on POWER8 with three chains (and potentially even faster with up to six chains). By breaking dependencies (while checking register pressure) later in strength reduction, we can get achieve performance. It's not clear to me whether the loop vectorizer can be changed to get this behavior, I'll have to investigate.

Thanks,

Olivier

http://reviews.llvm.org/D7128

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/