[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Tue Jan 21 14:44:55 PST 2014

On Jan 21, 2014, at 2:01 PM, Chandler Carruth <chandlerc at google.com> wrote:

> Just to add a few notes...
> 
> On Tue, Jan 21, 2014 at 1:31 PM, Andrew Trick <atrick at apple.com> wrote:
>> Chandler suggested a way around the problem. I'll work on that first.
> 
> It is very difficult to deal with the LoopPassManager. The concept doesn’t fit with typical loop passes, which may need to rerun function level analyses, and can affect code outside the current loop. One nice thing about it is that a pass can push a new loop onto the queue and rerun all managed loop passes just on that loop. In the past I’ve seen efforts to consolidate multiple loop passes into the same LoopPassManager, but don’t really understand the benefit other than reducing the one-time overhead of instantiating multiple pass managers and some minor IR access locality improvements. Maybe Chandler’s work will help with the overhead of instantiating pass managers.
> 
> Maybe, but currently I don't really understand the ideal state for loop pass pipeline formation. Maybe I'll sit down with you, Owen, and/or others that are more deeply involved in the loop passes to get a good design in my head. But we're not quite there yet. =]
>  
> 
> Anyway, I see no reason that the vectorizer shouldn’t run in its own loop pass manager.
> 
> It turns out, this is all moot - the loop vectorizer *already* doesn't run as part of these loop passes. It runs on its own during the late optimization phase.
> 
> If it is a loop pass at all, I would just suggest making it a function pass and being done with it. Then there will be no problems here. That make sense to you guys as well?

The LoopVectorizer depends on LCSSA and LoopSimplify. Both are loop passes. We will have to make them also available as utility functions.
> 
> I’m sure we could be more aggressive than we are currently at O3. If the vectorizer unroller’s conditions are too strict, we could consider calling the standard partial unroller only on the loops that didn’t get vectorized.
> 
> I think we can be more aggressive across the board. I've got a pretty good conservative benchmark suite and rig for increasing the aggressiveness of these kinds of optimizations (it is very sensitive to code size and consists of macro benchmarks that aren't usually bottlenecked on a single loop). I'm going to try some basic relaxing of the unrolling thresholds in the loop vectorizer to see if there is at least an *obvious* better baseline level. Then we can see if there is more detailed tuning to do.
> 
> However, I'd like to do that after the if-conversion issue is fixed if possible. I have bandwidth to do the evaluation of thresholds, but not really to hack on the if-conversion stuff. Any guesses if you guys will have time to look at that aspect?

I’ll look into the if-conversion issue. It should not be to hard to teach the vectorizer to handle stores and might even be beneficial in cases where we can vectorize (at least plausible for VF=2).

- Arnold