[PATCH] Fold selects when the non-selected elements are undef from shufflevectors

Wed May 7 00:57:00 PDT 2014

----- Original Message -----
> From: "Nick Lewycky" <nicholas at mxc.ca>
> To: "Philip Reames" <listmail at philipreames.com>
> Cc: "Benjamin Kramer" <benny.kra at gmail.com>, "Filipe Cabecinhas" <filcab+llvm.phabricator at gmail.com>, "Eli Friedman"
> <eli.friedman at gmail.com>, "LLVM Commits" <llvm-commits at cs.uiuc.edu>,
> reviews+D3561+public+feaea29ac1f7f0ef at reviews.llvm.org
> Sent: Wednesday, May 7, 2014 2:14:35 AM
> Subject: Re: [PATCH] Fold selects when the non-selected elements are undef	from shufflevectors
> 
> Philip Reames wrote:
> > Nadav,
> >
> > Could you spell out what you believe to be the right approach
> > w.r.t.
> > this topic? I'd also appreciate any links to previous discussion on
> > this
> > topic; a quick google search didn't find much.
> >
> > Reading through Filipe's patch, it seems entirely reasonable to me.
> > It's
> > a strict reduction in the number of shufflevectors in the IR. Even
> > given
> > the fact we can't lower shuffle vector perfectly in all cases, I
> > don't
> > see the harm in this patch. I'm not trying to say you're wrong; I'm
> > merely trying to understand what it is I'm not seeing here.
> 
> I think the situation is that perfect lowering of arbitrary IR
> shufflevectors is hard, because the instructions offered by hardware
> are
> very particular, because the search space is very large, and because
> we
> don't have a good algorithm yet.
> 
> One previous thread is here:
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173544.html
> and it gets into group theory here:
> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173601.html
> . Even this thread mention another thread the previous week.
> 
> The correct future design for llvm is to let the IR optimizer
> optimize
> the shufflevectors however it wants, and leave it to the backend to
> solve the problem.
> 
> However, this isn't what we do *today* and that's acceptable today
> people mostly write vectors in tight kernel loops themselves, or else
> the autovectorizer picked the exact right sequence for the hardware
> at hand.
> 
> In the future, those excuses aren't going to work. Code tuned for one
> microarch will find itself inlined all over a codebase where it will
> intersect code that was autovectorized to a different vector width,
> for
> instance. We need the optimizer to be able to optimize and let the
> code
> generator do the instruction selection.

I would like to think that this is true, but I think that given past experience, we'd need to demonstrate an algorithm capable of doing this. Perfect shuffle tables, at least by themselves, don't scale to large vector lengths, and I don't think anyone has yet demonstrated that optimized shuffle sequences can be generated with sufficient speed.

 -Hal

> 
> Nick
> 
> > On 05/06/2014 09:22 AM, Nadav Rotem wrote:
> >> Hi Filipe,
> >>
> >> Thank you for working on this. I appreciate the fact that you are
> >> eager to improve LLVM and that you went through the trouble of
> >> testing
> >> your code on multiple targets and on considering different
> >> alternatives to the problem. I think that the patch below is
> >> unrelated
> >> to the blend->select lowering so I will only address this patch.
> >> We
> >> discussed on the possibility of creating new shuffles on the llvm
> >> mailing lists more than once. We decided not to create new
> >> shuffles
> >> because we may create new shuffle patterns that we can’t lower
> >> efficiently. I ran into this problem a few times myself in
> >> different
> >> LLVM-based projects, mainly in the domain of graphics where people
> >> write swizzles themselves. Please do not commit this patch because
> >> it
> >> creates new shuffles. It is not the correct approach. You may get
> >> lucky in you selection of testcase, but I am sure that there are
> >> lots
> >> of other shuffle inputs on X86 and other targets that this patch
> >> makes
> >> worse.
> >>
> >> Thanks,
> >> Nadav
> >>
> >>
> >>> Hi Nadav,
> >>>
> >>> I'd really like this patch to get in, since it improves code
> >>> generation on many targets (at the very least, I've tried arm and
> >>> arm64 (mcpu=cyclone), ppc32 and ppc64 (mcpu=g5), and x86 and
> >>> x86-64
> >>> (mcpu=nehalem) and they all improve (they end up with no loads
> >>> vs. 1
> >>> or 2 loads, and code size is about half of what it was).
> >>>
> >>> Even if we'd like to make the select instruction be the proper
> >>> canonicalization of this operation, it seems that, for now, a
> >>> shufflevector is better when it comes to lowering it.
> >>>
> >>> Several targets have the action for the vselect as Expand, which
> >>> will
> >>> generate a bunch of code which would then have to be
> >>> pattern-matched,
> >>> instead of lowering the vselect as custom and falling-through to
> >>> Expand if there's no special-case instruction/set of instructions
> >>> that we can emit.
> >>>
> >>> I ran opt+llc in this way, for the variants I mentioned above
> >>> (./bin/opt is opt with this optimization):
> >>> ./bin/opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64
> >>> -mcpu=cyclone > arm64-cylone-opt
> >>> opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64
> >>> -mcpu=cyclone > arm64-cylone-opt
> >>>
> >>> I can understand if we eventually drop this after we make the
> >>> backend
> >>> treat the select as the canonical form of this operation and have
> >>> llvm always use that form, but for now it seems that the
> >>> shufflevector is better lowered in some important targets.
> >>>
> >>>
> >>> Filipe
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, May 2, 2014 at 11:55 PM, Nadav Rotem <nrotem at apple.com
> >>> <mailto:nrotem at apple.com>> wrote:
> >>>
> >>>     Hi Filipe,
> >>>
> >>>     We don’t generate new shufflevector instructions during
> >>>     optimizations because the lowering of shuffle instructions is
> >>>     really complicated and we don’t want to generate a shuffle
> >>>     that
> >>>     we don’t lower by accident.
> >>>
> >>>     Thanks,
> >>>     Nadav
> >>>
> >>>     On May 2, 2014, at 4:34 PM, Filipe Cabecinhas
> >>>     <filcab+llvm.phabricator at gmail.com
> >>>     <mailto:filcab%2Bllvm.phabricator at gmail.com>> wrote:
> >>>
> >>>     > Remove unused argument
> >>>     >
> >>>     > http://reviews.llvm.org/D3561
> >>>     >
> >>>     > Files:
> >>>     > lib/Transforms/InstCombine/InstCombineSelect.cpp
> >>>     > test/Transforms/InstCombine/select.ll
> >>>     > <D3561.9050.patch>_______________________________________________
> >>>     > llvm-commits mailing list
> >>>     > llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
> >>>     > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at cs.uiuc.edu  <mailto:llvm-commits at cs.uiuc.edu>
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory