[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Wed Jan 15 16:36:53 PST 2014

----- Original Message -----
> From: "Diego Novillo" <dnovillo at google.com>
> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Cc: nadav at apple.com
> Sent: Wednesday, January 15, 2014 6:13:27 PM
> Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with	profile info
> 
> I am starting to use the sample profiler to analyze new performance
> opportunities. The loop unroller has popped up in several of the
> benchmarks I'm running. In particular, libquantum. There is a ~12%
> opportunity when the runtime unroller is triggered.
> 
> This helps functions like quantum_sigma_x
> (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
> The function accounts for ~20% of total runtime. By allowing the
> runtime unroller, we can speedup the program by about 12%.
> 
> I have been poking at the unroller a little bit. Currently, the
> runtime unroller is only triggered by a special flag or if the target
> states it in the unrolling preferences. We could also consult the
> block frequency information here. If the loop header has a higher
> relative frequency than the rest of the function, then we'd enable
> runtime unrolling.
> 
> Chandler also pointed me at the vectorizer, which has its own
> unroller. However, the vectorizer only unrolls enough to serve the
> target, it's not as general as the runtime-triggered unroller. From
> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
> avx targets). Additionally, the vectorizer only unrolls to aid
> reduction variables. When I forced the vectorizer to unroll these
> loops, the performance effects were nil.

It may be worth noting, that the vectorizer's unrolling is modulo unrolling (in the sense that the iterations are maximally intermixed), and so is bound by register pressure considerations (especially in the default configuration, where CodeGen does not make use of AA, and so often cannot 'fix' an expensive unrolling that has increased register pressure too much).

The generic unroller, on the other hand, does concatenation unrolling, which has different benefits.

> 
> I'm currently looking at changing LoopUnroll::runOnLoop() to consult
> block frequency information for the loop header to decide whether to
> try runtime triggers for loops that don't have a constant trip count
> but could be partially peeled.
> 
> Does that sound reasonable?

This sounds good to me; I definitely feel that we should better exploit the generic unroller's capabilities.

The last time that I tried enabling runtime unrolling (and partial unrolling) over the entire test suite on x86, there were many speedups and many slowdowns (although slightly more slowdowns than speedups). You seem to be suggesting that restricting runtime unrolling to known hot loops will eliminate many of the slowdowns. I'm certainly curious to see how that turns out.

 -Hal

> 
> 
> Thanks.  Diego.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory