[PATCH][LoopVectorizer] Restrict the unroll factor of reductions in loops

Sun Aug 10 20:31:36 PDT 2014

On Typhoon the gain for libquantum is almost 2%, and about one percent on hmmer. No regression on CINT2006. This is O3 LTO, ref input.

-Gerolf

On Aug 8, 2014, at 4:57 PM, Gerolf Hoflehner <ghoflehner at apple.com> wrote:

> I second a tuning option at least in the short term. It is usually hard to get it right, though. So longer term this is a case for dynamic versioning that invokes different versions of the code at run-time depending on the trip count. 
> 
> -Gerolf
> 
> On Aug 8, 2014, at 12:30 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
>> ----- Original Message -----
>>> From: "James Molloy" <james.molloy at arm.com>
>>> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
>>> Cc: "llvm-commits" <llvm-commits at cs.uiuc.edu>
>>> Sent: Friday, August 8, 2014 9:37:38 AM
>>> Subject: [PATCH][LoopVectorizer] Restrict the unroll factor of reductions in	loops
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Hi Arnold,
>>> 
>>> 
>>> 
>>> Attached are two patches. The first ups the maximum unroll factor on
>>> AArch64 from 2 to 4, for C-A57 only at the moment as that’s all I’ve
>>> got data for. This gives us significant wins – ~14% on
>>> 462.libquantum at least.
>>> 
>>> 
>>> 
>>> However it also causes some regressions. The second patch makes the
>>> loop vectorizer a bit more conservative with its unroll factor. The
>>> problem is purely for reductions within loops. The regressions I’ve
>>> seen are small (but runtime-known) trip count loops within a loop
>>> nest. A loop unroll factor of 2 is fine, but above 2 the reduction
>>> variable fixup logic after the loop increases the critical path
>>> length and resource usage. For most loops this isn’t a problem, but
>>> small loops in a larger loop nest will execute this fixup code many
>>> times.
>> 
>> Can you please add a flag for this? I anticipate needing to tune it.
>> 
>> Also, it seems to me that this is exactly the kind of thing that would benefit from profiling information (so we can determine if the inner loop is likely to have a large trip count). Can the current infrastructure do this? Also, maybe in cases where the inner loop count is not a function of the outer loop, we might 'unswitch' it so that we get the unrolled inner loop only when actually profitable.
>> 
>> Thanks again,
>> Hal
>> 
>>> 
>>> 
>>> 
>>> The heuristic is: if this is a (scalar) reduction, and the loop is
>>> nested, clamp the UF to a maximum of 2. With 2, we still get wins
>>> but we only add one fadd/fmul to the critical path.
>>> 
>>> 
>>> 
>>> Please take a look.
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> James
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>> 
>> -- 
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>