[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Wed Jan 15 16:41:32 PST 2014

On Wed, Jan 15, 2014 at 4:36 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Diego Novillo" <dnovillo at google.com>
>> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
>> Cc: nadav at apple.com
>> Sent: Wednesday, January 15, 2014 6:13:27 PM
>> Subject: [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with       profile info
>>
>> I am starting to use the sample profiler to analyze new performance
>> opportunities. The loop unroller has popped up in several of the
>> benchmarks I'm running. In particular, libquantum. There is a ~12%
>> opportunity when the runtime unroller is triggered.
>>
>> This helps functions like quantum_sigma_x
>> (http://sourcecodebrowser.com/libquantum/0.2.4/gates_8c_source.html#l00149).
>> The function accounts for ~20% of total runtime. By allowing the
>> runtime unroller, we can speedup the program by about 12%.
>>
>> I have been poking at the unroller a little bit. Currently, the
>> runtime unroller is only triggered by a special flag or if the target
>> states it in the unrolling preferences. We could also consult the
>> block frequency information here. If the loop header has a higher
>> relative frequency than the rest of the function, then we'd enable
>> runtime unrolling.
>>
>> Chandler also pointed me at the vectorizer, which has its own
>> unroller. However, the vectorizer only unrolls enough to serve the
>> target, it's not as general as the runtime-triggered unroller. From
>> what I've seen, it will get a maximum unroll factor of 2 on x86 (4 on
>> avx targets). Additionally, the vectorizer only unrolls to aid
>> reduction variables. When I forced the vectorizer to unroll these
>> loops, the performance effects were nil.
>
> It may be worth noting, that the vectorizer's unrolling is modulo unrolling (in the sense that the iterations are maximally intermixed), and so is bound
> by register pressure considerations (especially in the default configuration, where CodeGen does not make use of AA, and so often cannot 'fix' an
> expensive unrolling that has increased register pressure too much).
>
> The generic unroller, on the other hand, does concatenation unrolling, which has different benefits.

Thanks.

> This sounds good to me; I definitely feel that we should better exploit the generic unroller's capabilities.
>
> The last time that I tried enabling runtime unrolling (and partial unrolling) over the entire test suite on x86, there were many speedups and many
> slowdowns (although slightly more slowdowns than speedups). You seem to be suggesting that restricting runtime unrolling to known hot loops will
> eliminate many of the slowdowns. I'm certainly curious to see how that turns out.

Right. If I force the runtime unroller, I get a mixed bag of speedups
and slowdowns. Additionally, code size skyrockets. By using it only on
the functions that have hot loops (as per the profile), we only unroll
those that make a difference. In the case of libquantum, there is a
grand total of 3 loops that need to be runtime unrolled.

Diego.