[llvm] r180802 - InstCombine: Fold more shuffles of shuffles.

Tue Apr 30 19:55:13 PDT 2013

----- Original Message -----
> From: "Jim Grosbach" <grosbach at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-commits at cs.uiuc.edu, "Nadav Rotem" <nrotem at apple.com>
> Sent: Tuesday, April 30, 2013 7:26:33 PM
> Subject: Re: [llvm] r180802 - InstCombine: Fold more shuffles of shuffles.
> 
> Note, I reverted the original change in r180834 since we’re actively
> discussing the right approach.

I don't disagree with reverting the change in this case, as there is disagreement regarding the correct approach. Nevertheless, I think that something about the objections to the change seems misguided: it seems that we've traded a demonstrated performance improvement on at least one platform for a theoretical lack-of-regression on another. If there are cases on x86 where we need to preserve carefully-constructed shuffle combinations, then we need to have examples of these as regression tests, and we need examples in the test suite. Otherwise, we have no real way to insure that we won't regress on those kernels (which likely rely on other factors besides the shuffle selections). Maybe MI-level CSE will be sufficient to cleanup after a DAGCombine-level shuffle combination, but I suspect we'll get better results from doing this optimization earlier.

 -Hal

> 
> 
> -j
> 
> 
> On Apr 30, 2013, at 4:48 PM, Jim Grosbach < grosbach at apple.com >
> wrote:
> 
> 
> 
> 
> On Apr 30, 2013, at 4:07 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> 
> ----- Original Message -----
> 
> 
> From: "Jim Grosbach" < grosbach at apple.com >
> To: "Nadav Rotem" < nrotem at apple.com >
> Cc: llvm-commits at cs.uiuc.edu , "Hal Finkel" < hfinkel at anl.gov >
> Sent: Tuesday, April 30, 2013 5:01:26 PM
> Subject: Re: [llvm] r180802 - InstCombine: Fold more shuffles of
> shuffles.
> 
> 
> 
> 
> On Apr 30, 2013, at 2:38 PM, Nadav Rotem < nrotem at apple.com > wrote:
> 
> 
> 
> 
> On Apr 30, 2013, at 2:23 PM, Jim Grosbach < grosbach at apple.com >
> wrote:
> 
> 
> 
> That used to be true. These days, we see lots of shuffles coming from
> SROA and we have to mess with them or things go very badly.
> 
> 
> You can change SROA to reduce the number of shuffle vectors. I
> scanned SROA and I did not see anything too bad. I only saw
> insert/extract and blend, which should map nicely to any vector ISA.
> But you can make things better by changing SROA.cpp::insertVector to
> generate a select instruction, instead of a vector shuffle.
> 
> It’s more complicated than that. Have a look at the example on the PR
> that motives this patch to see what I mean. If you have an
> alternative suggestion, I’d love to hear it.
> 
> 
> 
> 
> 
> 
> 
> The more I think about it, the more I disagree with the current
> design. It creates a very deep tight coupling between what
> InstCombine does and the least-common-denominator of what target
> codegen can handle. That’s untenable in the long term and
> artificially restrictive.
> 
> 
> The motivation for the current design (which I am happy with) is that
> it is very difficult to undo a canonicalization of shuffles because
> the search space is huge. Also, the clang shuffle builtins are used
> by people who optimize their performance sensitive applications.
> 
> 
> 
> 
> This code is mostly pretty ancient. I wouldn’t read very much into
> the current implementation with regards to motivations. A lot has
> changed in the last few years. In any case, because something is
> difficult doesn’t mean it’s not the right thing to do. Don’t get me
> wrong, I’m not un-sympathetic to being pragmatic here. We have to
> be. There’s always a balance between the theoretically best approach
> and what we do in practice out of expedience. In this case, however,
> we’re aggressively pessimizing one target to help out another.
> 
> It is more than one target being hurt ;) -- but the current
> pessimizing, from what I can tell, is designed to protect against
> two things:
> 
> 1. Creating shuffles that are expensive to materialize (either
> because they are not naively supported, or because they are just
> slower (because of this AVX issue, for example). We have a TTI
> interface for shuffle costs, maybe we should use it?
> 
> 
> 
> Right. It’s effectively taking a least-common-denominator approach
> and only doing a very few things that it knows are always safe, for
> some definitions of “always.” The current change is obviously too
> liberal and should be revisited. The question is whether there is a
> sufficiently conservative transformation we can/should make here
> that addresses the original problem or whether we should fold these
> shuffles away somewhere else. I’m on the fence. I like the idea, in
> principal, of doing it here. I’m unsure if that’s best in practice.
> For example, checks that we’re really doing subvector operations
> here rather that a generic shuffle. That’s complicated by not having
> done type legalization yet, so the whole idea of what shuffles will
> be “safe” is really hard.
> 
> 
> TTI is a possibility. Though despite the FIXME I added, I’m somewhat
> loathe to do that. I want to keep target-dependent logic out of the
> IR level passes as much as possible. In particular ones, like
> InstCombine, that are typically run as part of ‘opt’ rather than
> ‘llc’.
> 
> 
> Nadav and I looked at this a bit more and are considering a
> DAGCombine instead. While we don’t have as much of the existing
> logic to leverage there, we do have a lot more information about the
> target available to us. Tradeoffs abound.
> 
> 
> 
> 
> 
> 
> 2. Creating too many different shuffles. On architectures that
> support arbitrary shuffles with relatively low cost, arbitrary
> shuffles generally require a shuffle-specification input in a
> register, and too many unique shuffles therefore increases register
> pressure. It is not clear how much of a problem this is in practice,
> but I recall a comment in InstCombine somewhere on this point.
> 
> 
> 
> That makes sense. In a perfect world, I’d like to see a target
> transform to clean that sort of thing up. We’re pretty far from
> there right now, though.
> 
> 
> 
> 
> -Hal
> 
> 
> 
> This
> patch flips that around for which target gets pessimized. My
> contention is that we can do better than the current extremely
> conservative heuristic and fix that.
> 
> 
> We obviously shouldn’t immediately remove all the “be careful here or
> targets will probably freak out” safeguards. That would have nasty
> results, to put it mildly. We should, however, be actively working
> to find ways to allow this (and other) code to be as aggressive as
> possible. The question for the moment is what additional conditions
> can we add that will allow the transformation for the situation we
> want to enable and still avoid it when it’s troublesome.
> Suggestions?
> 
> 
> -Jim
> 
> 
> 
> 
> 
> 
> 
> We use this approach in other places in the compiler. We don't always
> canonicalize all the way and in many cases we prefer not to touch
> the program unless we are certain that the change is beneficial. The
> shuffle ISA is so sparse that merging shuffles are likely to cause
> regressions.
> 
> 
> 
> We do. And it’s a compromise every single time. It’s technical debt,
> and we should be actively working to reduce it, not add to it.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> When I looked at the test case I noticed that some of your shuffles
> were actually vector blends. It would make sense to convert
> blend-shuffles into select instructions with a constant mask.
> 
> 
> 
> 
> Sounds like a good idea. I don’t know much about the provenance of
> the original IR of these test cases.
>