<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Excessive loop unrolling"

   href="https://bugs.llvm.org/show_bug.cgi?id=42987">42987</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Excessive loop unrolling

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>6.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>C++

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>agner@agner.org

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>blitzrakete@gmail.com, dgregor@apple.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk

          </td>

        </tr></table>

      <p>

        <div>

        <pre>I think we need to re-evaluate the advantages and disadvantages of loop

unrolling.

Clang is often unrolling loops excessively in cases where there is no advantage

in unrolling.

Loop unrolling is advantageous when the loop overhead is costly or when

expressions or branches that depend on the loop counter can be simplified. But

loop unrolling gives no advantage when the bottleneck lies elsewhere.

The limiting factor is likely to be the floating point/vector unit in the CPU

if a loop contains floating point or vector code. The loop overhead is often

reduced to an integer addition and a fused compare/branch instruction. 

The integer unit has plenty of resources to run the loop overhead

simultaneously with the floating point or vector code at zero extra cost.

The situation is no better if the instruction decoder is the bottleneck, which

is quite often the case. A tiny loop will fit into the micro-op cache or

loopback buffer of modern CPUs so that the loop will run on decoded

instructions only. A large unrolled loop is unlikely to fit into these buffers,

which means that the unrolled loop is slower.

Even if the unrolled loop is not slower when measured in isolation, it can slow

down other parts of the program because it consumes excessive amounts of code

cache.

A simple example:

const int size = 58;

double a[size], b[size], c[size];

void test () {

    for (int i = 0; i < size; i++) {

        a[i] = b[i] + c[i];    

    }

}

clang -O2 -m64 will unroll this loop completely up to size = 59

clang -O3 -m64 will unroll this loop completely up to size = 119

Clang is vectorizing the loop, which is a good thing, but there is no advantage

in unrolling further.

Literature: I have described the loopback buffer, micro-op cache, and other

details of different CPUs in the manual "The microarchitecture of Intel, AMD

and VIA CPUs" <a href="https://www.agner.org/optimize/microarchitecture.pdf">https://www.agner.org/optimize/microarchitecture.pdf</a></pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>