[PATCH] Fold selects when the non-selected elements are undef from shufflevectors

Tue May 6 15:08:37 PDT 2014

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Philip Reames" <listmail at philipreames.com>
> Cc: reviews+D3561+public+feaea29ac1f7f0ef at reviews.llvm.org, "Filipe Cabecinhas" <filcab+llvm.phabricator at gmail.com>,
> "Benjamin Kramer" <benny.kra at gmail.com>, "LLVM Commits" <llvm-commits at cs.uiuc.edu>, "Eli Friedman"
> <eli.friedman at gmail.com>
> Sent: Tuesday, May 6, 2014 2:31:28 PM
> Subject: Re: [PATCH] Fold selects when the non-selected elements are undef from	shufflevectors
> 
> 
> 
> 
> 
> On May 6, 2014, at 12:17 PM, Philip Reames <
> listmail at philipreames.com > wrote:
> 
> 
> Hi Philip,
> 
> 
> 
> 
> 
> Nadav,
> 
> Could you spell out what you believe to be the right approach w.r.t.
> this topic? I'd also appreciate any links to previous discussion on
> this topic; a quick google search didn't find much.
> 

As a general matter, we try to do these kinds of optimizations in CodeGen. Specifically, DAGCombine would be the natural place, because there you can check legality of the relevant operations and you can call isShuffleMaskLegal on the new potential shuffle mask.

As Nadav has said, we really try not to create new shuffle masks because it is *really hard* to efficiently decompose arbitrary shuffles into high-quality instruction sequences on some targets. Essentially, there is no good canonical form for them. I think that one could make a case that, because there is no canonical form for shuffles, this might be the one circumstance where TTI would be appropriate in InstCombine, but I think that a CodeGen-based solution would need to be proved insufficient first.

 -Hal

> 
> 
> This is one thread:
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20120402/140452.html
> 
> 
> This is another:
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173217.html
> 
> 
> A presentation that discusses how difficult it is to lower shuffles:
> http://www.nondot.org/~sabre//2012-04-02-CGOKeynote.pdf
> 
> 
> 
> 
> 
> Reading through Filipe's patch, it seems entirely reasonable to me.
> It's a strict reduction in the number of shufflevectors in the IR.
> 
> 
> The problem is not the number of shuffles in the IR. The problem is
> the number of generated instructions. Shuffle instructions are
> really difficult to lower and the current design decision is not to
> introduce new shuffle instructions.
> 
> 
> 
> Even given the fact we can't lower shuffle vector perfectly in all
> cases, I don't see the harm in this patch. I'm not trying to say
> you're wrong; I'm merely trying to understand what it is I'm not
> seeing here.
> 
> 
> 
> There is lots of code out there that uses llvm shuffles with
> knowledge of the underlying architecture. This is very common in
> OpenCL and OpenGL implementation and introducing new shuffles that
> we can’t lower properly can break people’s hand-optimized matrix
> multiplication code.
> 
> 
> Thanks,
> Nadav
> 
> 
> 
> 
> Philip
> 
> 
> 
> On 05/06/2014 09:22 AM, Nadav Rotem wrote:
> 
> 
> 
> Hi Filipe,
> 
> 
> Thank you for working on this. I appreciate the fact that you are
> eager to improve LLVM and that you went through the trouble of
> testing your code on multiple targets and on considering different
> alternatives to the problem. I think that the patch below is
> unrelated to the blend->select lowering so I will only address this
> patch. We discussed on the possibility of creating new shuffles on
> the llvm mailing lists more than once. We decided not to create new
> shuffles because we may create new shuffle patterns that we can’t
> lower efficiently. I ran into this problem a few times myself in
> different LLVM-based projects, mainly in the domain of graphics
> where people write swizzles themselves. Please do not commit this
> patch because it creates new shuffles. It is not the correct
> approach. You may get lucky in you selection of testcase, but I am
> sure that there are lots of other shuffle inputs on X86 and other
> targets that this patch makes worse.
> 
> 
> Thanks,
> Nadav
> 
> 
> 
> 
> 
> 
> 
> Hi Nadav,
> 
> 
> I'd really like this patch to get in, since it improves code
> generation on many targets (at the very least, I've tried arm and
> arm64 (mcpu=cyclone), ppc32 and ppc64 (mcpu=g5), and x86 and x86-64
> (mcpu=nehalem) and they all improve (they end up with no loads vs. 1
> or 2 loads, and code size is about half of what it was).
> 
> 
> Even if we'd like to make the select instruction be the proper
> canonicalization of this operation, it seems that, for now, a
> shufflevector is better when it comes to lowering it.
> 
> 
> Several targets have the action for the vselect as Expand, which will
> generate a bunch of code which would then have to be
> pattern-matched, instead of lowering the vselect as custom and
> falling-through to Expand if there's no special-case instruction/set
> of instructions that we can emit.
> 
> 
> I ran opt+llc in this way, for the variants I mentioned above
> (./bin/opt is opt with this optimization):
> ./bin/opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64
> -mcpu=cyclone > arm64-cylone-opt
> 
> opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64
> -mcpu=cyclone > arm64-cylone-opt
> 
> 
> 
> I can understand if we eventually drop this after we make the backend
> treat the select as the canonical form of this operation and have
> llvm always use that form, but for now it seems that the
> shufflevector is better lowered in some important targets.
> 
> 
> 
> 
> Filipe
> 
> 
> 
> 
> 
> 
> 
> On Fri, May 2, 2014 at 11:55 PM, Nadav Rotem < nrotem at apple.com >
> wrote:
> 
> 
> Hi Filipe,
> 
> We don’t generate new shufflevector instructions during optimizations
> because the lowering of shuffle instructions is really complicated
> and we don’t want to generate a shuffle that we don’t lower by
> accident.
> 
> Thanks,
> Nadav
> 
> 
> 
> On May 2, 2014, at 4:34 PM, Filipe Cabecinhas <
> filcab+llvm.phabricator at gmail.com > wrote:
> 
> > Remove unused argument
> > 
> > http://reviews.llvm.org/D3561
> > 
> > Files:
> > lib/Transforms/InstCombine/InstCombineSelect.cpp
> > test/Transforms/InstCombine/select.ll
> > <D3561.9050.patch>_______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> 
> 
> 
> _______________________________________________
> llvm-commits mailing list llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory