[llvm] r180802 - InstCombine: Fold more shuffles of shuffles.

Tue Apr 30 21:49:48 PDT 2013

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Jim Grosbach" <grosbach at apple.com>, llvm-commits at cs.uiuc.edu
> Sent: Tuesday, April 30, 2013 11:29:55 PM
> Subject: Re: [llvm] r180802 - InstCombine: Fold more shuffles of shuffles.
> 
> 
> 
> 
> 
> 
> I don't disagree with reverting the change in this case, as there is
> disagreement regarding the correct approach. Nevertheless, I think
> that something about the objections to the change seems misguided:
> it seems that we've traded a demonstrated performance improvement on
> at least one platform for a theoretical lack-of-regression on
> another. If there are cases on x86 where we need to preserve
> carefully-constructed shuffle combinations, then we need to have
> examples of these as regression tests, and we need examples in the
> test suite. Otherwise, we have no real way to insure that we won't
> regress on those kernels (which likely rely on other factors besides
> the shuffle selections). Maybe MI-level CSE will be sufficient to
> cleanup after a DAGCombine-level shuffle combination, but I suspect
> we'll get better results from doing this optimization earlier.
> 
> 
> Hi Hal,
> 
> 
> Reverting this patch was the right thing to do. At the end of the
> day, we can solve this with a simple DAGCombine optimization that
> merges concat_vector and extract_subvector. Notice that all of the
> shuffles that come out of SROA are legal and they are converted into
> the insert/extract/concat_vector SDNodes. We can continue the
> discussion on the shuffle optimization, but it is irrelevant to this
> PR. Regarding the changes to InstCombine, the regressions on x86 are
> not theoretical and I mentioned a specific example[1] (AOS to SOA
> transpose where merging shuffles will hurt the performance due to
> cross-lane shuffles.

I agree with almost all of this, and I'm not saying that the architectural issue is theoretical (it is well known, and clearly not). My point is that this cross-lane shuffle problem cannot be treated in isolation. Any kernel which depends on that behavior will also be sensitive to other factors (i-cache contention, register pressure, scheduling, etc.) and trying to preserve the performance of a class of kernels simply by inhibiting shuffle combination is unlikely to be successful. It might turn out that for most kernels that are sensitive to cross-lane shuffle costs, those costs are offset by gains from these other factors. You seem to be suggesting otherwise, and that is what I was calling theoretical, because we have no explicit test cases. In short, it seems as though this decision should be more data-driven. I'll be curious to know if there is any significant difference in performance on ARM between the InstCombine implementation and the DAGCombine one.

> In theory the codegen could recover, but there
> are a number of problems. One problem is that the search space is
> huge.

Yes, but the search space for cases where this matters might be small. Meaning that combining two cheap shuffles might be a loss, but combining three or four into one cross-lane shuffle might still be a win. In this case, we actually only need to undo the shuffle combination in the backend if it is a combination of two cheaper shuffles, and this might be quite doable.

> Another problem is that we lower bottom-up and we don't have a
> good way to take into account the number of users that different
> shuffle nodes have. InstCombine needs to canonicalize (not to
> optimize) only in cases where it makes sense.

Agreed.

Thanks again,
Hal

> 
> 
> Thanks,
> Nadav
> 
> 
> 
> 
> 
> 
> [1] -
> http://software.intel.com/en-us/articles/3d-vector-normalization-using-256-bit-intel-advanced-vector-extensions-intel-avx
> 
> 
> 
>