<html>
    <head>
      <base href="http://llvm.org/bugs/" />
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW --- - Improve instruction selection for addition reductions"
   href="http://llvm.org/bugs/show_bug.cgi?id=21371">21371</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Improve instruction selection for addition reductions
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: AArch64
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>james.molloy@arm.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvmbugs@cs.uiuc.edu
          </td>
        </tr>

        <tr>
          <th>Classification</th>
          <td>Unclassified
          </td>
        </tr></table>
      <p>
        <div>
        <pre>When doing a simple "add" reduction, the code generated to do the final
"reduce" step is sub-optimal. The cleanup code needs to add all elements of a
vector together and return the result.

The following code:

int arr[5000];

int f() {
  int a = 0;
  for (int i = 0; i < 5000; ++i)
    a += arr[i];
  return a;
}

Generates the following reduction cleanup (this is outside the loop):

// BB#2:                                // %middle.block
    add    v0.4s, v1.4s, v0.4s         // add two reduction vectors together
(UF=2)
    ext    v1.16b, v0.16b, v0.16b, #8  // v1 = high half of $v0 (high half of
v1 and v0 unused after this point)
    add    v0.4s, v0.4s, v1.4s         // v0 = high half of $v0 + low half of
$v0
    dup    v1.4s, v0.s[1]              // v1[0] = 1'st element of v0
    add    v0.4s, v0.4s, v1.4s         // v0[0] = zero'th element of v0 + 1'st
element of v0
    fmov    w0, s0                      // Swap back to GPRs.
    ret

Note that the first ADD is adding two reduction vectors together, as the unroll
factor was 2. I've used the notation "$v0" to refer to the value of "v0" at the
beginning of the sequence.

This could be simplified to:
// BB#2:                                // %middle.block
    add    v0.4s, v0.4s, v1.4s         // Add two reduction vectors together
(UF=2)
    addv    s0, v0.4s                   // Cross-lane reduce.
        fmov    w0, s0                      // Swap back to GPRs
    ret

This reduces 4 instructions to 1, and would have a bigger impact when reducing
sub-integer vectors (vectors with more than 4 elements).
Other instructions that could be useful for reductions:
  * ADDV: integer add reduce across vector.
  * FMAXNMV: Floating point maxnum across vector.
  * FMAXV: Floating point maximum across vector.
  * FMINNMV: Floating point minnum across vector.
  * FMINV: Floating point minimum across vector.</pre>
        </div>
      </p>
      <hr>
      <span>You are receiving this mail because:</span>
      
      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>