[PATCH] Fix PR19657 : SLP vectorization doesn't combine scalar load to vector loads

Arnold Schwaighofer aschwaighofer at apple.com
Tue May 27 11:57:47 PDT 2014


I don’t think there is a general solution using the current algorithm (relying on source order).

I think we run into the limits of what we can do with the existing scheduling check based on source order (inspecting all users and making sure that they are after the last instruction in the vectorized bundle).
The general problem: We have to show that by vectorizing a set of nodes the dependence graph is still cycle free. Cheaply :).

The cheap part is hard for two reasons:

- By being able to vectorize more trees (the once that are schedulable but not detected by the current algorithm) you automatically make runtime worse even if the new algorithm has the same complexity.

- Detecting cycles in a dynamically changing DAG has a higher complexity than what we currently do.

Personally, I would like to see how bad an algorithm based on dynamically maintaining a topological sort would fare in practice. At the very least, once we move the SLPVectorizer out of the inliner that extra cost might not matter that much given the benefit of more code being vectorized. We currently have a hard time starting vectorization mid graph and miss opportunities because of that.
Depend on the optimization flag we might want to switch between the two.

—

Alternatively, one could treat loads specially and try to push it and its inputs up instead of down if there is a conflict.
This would only solve problems involving loads like the one below. If we had an internal shared node (not a load) this would not help.

—

On May 23, 2014, at 11:00 AM, Raul Silvera <rsilvera at google.com> wrote:

> I agree this is an improvement, but this approach is still failing to catch cases where the common uses are not on separate paths on the tree. In those cases no matter which order we take there will be a common use that can't be scheduled.
> 
> Here is a sample case where we still fail to schedule a common load:
> 
> void foo(double *x, double C) {
>  x[0] = x[0]*C + x[0] * x[0];
>  x[1] = x[1]*C + x[1] * x[1];
> }
> Any ideas on how to get those cases? It would seem to me we'd need either a prepass or deferral of the decisions until the whole tree is inspected. Thoughts?
> 
> 
> Raúl E Silvera | SWE | rsilvera at google.com | 408-789-2846
> 
> 
> 
> On Fri, May 23, 2014 at 9:32 AM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi Karthik,
> 
> Can you please measure the effects of this patch on the LLVM test suite?  It would be interesting to see if other workloads are affected by this change and if they improve or regress.
> 
> Thanks,
> Nadav
> 
> http://reviews.llvm.org/D3800
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140527/06a3ade7/attachment.html>


More information about the llvm-commits mailing list