[PATCH][LoopVectorizer] Restrict the unroll factor of reductions in loops

Fri Aug 8 10:11:13 PDT 2014

This makes sense to me. 

We should evaluate the impact on cyclone (by changing James’ patch to also return 4 for cyclone).

Gerolf could you run benchmarks?

> On Aug 8, 2014, at 7:37 AM, James Molloy <James.Molloy at arm.com> wrote:
> 
> Hi Arnold,
>  
> Attached are two patches. The first ups the maximum unroll factor on AArch64 from 2 to 4, for C-A57 only at the moment as that’s all I’ve got data for. This gives us significant wins – ~14% on 462.libquantum at least.

Is this from quantum_toffoli? (We saw similar wins there for x86_64).
>  
> However it also causes some regressions. The second patch makes the loop vectorizer a bit more conservative with its unroll factor. The problem is purely for reductions within loops. The regressions I’ve seen are small (but runtime-known) trip count loops within a loop nest. A loop unroll factor of 2 is fine, but above 2 the reduction variable fixup logic after the loop increases the critical path length and resource usage. For most loops this isn’t a problem, but small loops in a larger loop nest will execute this fixup code many times.
>  
> The heuristic is: if this is a (scalar) reduction, and the loop is nested, clamp the UF to a maximum of 2. With 2, we still get wins but we only add one fadd/fmul to the critical path.
>  
> Please take a look.
>  
> Cheers,
>  
> James 
> <up-max-unroll.diff><limit-scalar-reductions.diff>