[LLVMdev] IR Passes and TargetTransformInfo: Straw Man

Tue Jul 30 07:35:40 PDT 2013

On 7/29/2013 6:28 PM, Andrew Trick wrote:
>
> You mean that LICM and Unswitching should be left for later? For the purpose of exposing scalar optimizations, I'm not sure I agree with that but I'd be interested in examples.

Optimizations like LICM, and unswitching can potentially damage perfect 
nesting of loops.  For example, consider this nest:

   for (i) {
     for (j) {
       ... = A[i];
     }
   }

The load of A[i] is invariant in the j-loop (assuming no aliased 
stores), and even if it was not invariant, the address computation code 
is.  A pass of LICM could then generate something like this:

   for (i) {
     t = A[i];
     for (j) {
       ... = t;
     }
   }

This is no longer a perfect nest, and a variety of loop nest 
optimizations will no longer be applicable.  In general, nest 
optimizations have a greater potential for improving code performance 
than scalar optimizations, and should be given priority over them.  In 
most cases (excluding loop interchange, for example), the LICM 
opportunity will remain, and can be taken care of later.

An example of where the target-dependent factors come into the picture 
before the target-specific stage (target-specific optimizations, 
lowering, etc.), may be loop distribution.  The example below does not 
belong to the nest optimizations, but the generic target-independent 
scalar optimizations will still apply after this transformation.
Say that we have a loop that performs several computations, some of 
which may be amenable to more aggressive optimizations:

   for (i) {
     v = 1.0/(1.0 + A[i]*A[i]);
     B[i] = (1.0 - B[i])*v
   }

Suppose that we have a hardware that can perform very fast FMA 
operations (possibly vectorized).  We could transform the loop into

   for (i) {
     t = 1.0 + A[i]*A[i];
     v = 1.0/t;
     B[i] = (1.0 - B[i])*v;
   }

And then expand t into an array, and distribute the loop:

   for (i) {
     t[i] = 1.0 + A[i]*A[i];
   }
   for (i) {
     v = 1.0/t[i];
     B[i] = (1.0 - B[i])*v;
   }

The first loop may then be unrolled and vectorized, while the second one 
may simply be left alone.

This may be difficult to address in a target-independent way (via a 
generic TTI interface), since we are trying to identify a specific 
pattern that, for the specific hardware, may be worth extracting out of 
a more complex loop.  In addition to that, for this example to make 
sense, the vectorization passes would have to run afterwards, which 
would then put a bound on how late this transformation could be done.

I guess the approach of being able to extend/modify what the PMB 
creates, that you mention below, would address this problem.

> I think you're only worried about the impact on loop nest optimizations. Admittedly I'm not making much concessesion for that, because I think of loop nest optimization as a different tool that will probably want fairly substantial changes to the pass pipeline anyway.

Yes, I agree.  My major concern is that we need to be careful that we 
don't accidentally make things harder for ourselves.  I very much agree 
with the canonicalization approach, and loop nests can take a bit of 
preparation (even including loop distribution).  At the same time, there 
are other optimizations that will destroy the canonical structures (of 
loops, loop nests, etc.), that also need to take place at some point. 
There should be a place in the optimization sequence, where the 
"destructive" optimizations have not yet taken place, and where the 
"constructive" (canonical) passes can be executed.  The farther down the 
optimization stream we push the "destructive" ones, the more flexibility 
we will have as to what types of transformations can be done that 
require the code to be in a canonical form.

> Here's a few of ways it might work:
> (1) Loop nest optimizer extends the standard PMB by plugging in its own passes prior to Generic Loop Opts in addition to loading TTI. The loop nest optimizer's passes are free to query TTI:
>
> (2) Loop nest optimizer suppresses generic loop opts through a PMB flag (assuming they are too disruptive). It registers its own loop passes with the Target Loop Opts. It registers instances of generic loop opts to now run after loop nest optimization, and registers new instances of scalar opts to rerun after Target Loop Opts if needed.
>
> (3) If the loop nest optimizer were part of llvm core libs, then we could have a completely separate passmanager builder for it.

All of these approaches would work (even concurrently).  I think (3) 
could potentially be the future goal.

> Are you afraid that LICM and unswitching will obfuscate the loops to the point that you can’t recognize the idiom? The current pass pipeline would have the same problem.

Actually, in this case my concern was the interleaving of the 
target-dependent code with target-independent code (if we were to do all 
idiom recognition in the same pass), or code duplication (if the 
target-independent and target-dependent passes were to be separate). 
The more I think about it, the more I favor the "separate" approach, 
since the target-specific idioms may be very different for different 
targets, and there doesn't seem to be much that can be handled in a 
common code (hence not a lot of actual duplication would happen).

-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation