[PATCH] Fold selects when the non-selected elements are undef from shufflevectors

Thu May 8 11:25:21 PDT 2014

Philip Reames wrote:
>
> On 05/07/2014 12:14 AM, Nick Lewycky wrote:
>> Philip Reames wrote:
>>> Nadav,
>>>
>>> Could you spell out what you believe to be the right approach w.r.t.
>>> this topic? I'd also appreciate any links to previous discussion on this
>>> topic; a quick google search didn't find much.
>>>
>>> Reading through Filipe's patch, it seems entirely reasonable to me. It's
>>> a strict reduction in the number of shufflevectors in the IR. Even given
>>> the fact we can't lower shuffle vector perfectly in all cases, I don't
>>> see the harm in this patch. I'm not trying to say you're wrong; I'm
>>> merely trying to understand what it is I'm not seeing here.
>>
>> I think the situation is that perfect lowering of arbitrary IR
>> shufflevectors is hard, because the instructions offered by hardware
>> are very particular, because the search space is very large, and
>> because we don't have a good algorithm yet.
>>
>> One previous thread is here:
>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173544.html
>> and it gets into group theory here:
>> http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173601.html
>> . Even this thread mention another thread the previous week.
>>
>> The correct future design for llvm is to let the IR optimizer optimize
>> the shufflevectors however it wants, and leave it to the backend to
>> solve the problem.
> Thanks for the background on the general problem, that was quite helpful.
>
> After thinking about it a bit, I don't have a good general solution, but
> I find myself really bothered by the fact that we're avoiding
> canonicalization at the IR level (which might expose other non-trivial
> optimizations), because we're not sure how to lower something for some
> targets. This seems less than desirable.
>
> Do we have a means to determine whether a particular shufflevector is
> "trivially" (for some definition thereof) lowerable? If so, we could use
> that as a profitability check on optimizations like the one Filipe
> originally proposed. (e.g. if the new sufflevector maps to a single
> instruction, it's clearly better than two instructions -- hopefully)
> Given this would be using target information, the optimization might
> have to be moved later in the optimizer.

One "rule of thumb" I proposed is that there are certain shuffles that 
we should be able to lower on any architecture:

  * the no-op, obviously
  * splats
  * reverse all elements
  * any of the above with 'undef' slotted in for any of the 
elements/indices.

I forget whether I got this rule into the instcombine transform I made, 
but I think this is a decent compromise for now.

Also, frankly, we don't do perfect lowering of non-shuffle instructions 
either. If we can do a good enough job, then people who would die if 
they don't get the right machine instructions for vector shuffles can do 
what we tell all the other people who'd die if they don't get the right 
machine instructions: use assembly.

Nick

> I'm not sure this particular idea is the right approach, but my thought
> process here is that maybe we can identify some subset of the original
> optimization which clearly *is* profitable, even given our lowering
> restrictions. That seems like a logical path forward if we can easily
> achieve it.
>
> Philip
>
>
>
>>
>> However, this isn't what we do *today* and that's acceptable today
>> people mostly write vectors in tight kernel loops themselves, or else
>> the autovectorizer picked the exact right sequence for the hardware at
>> hand.
>>
>> In the future, those excuses aren't going to work. Code tuned for one
>> microarch will find itself inlined all over a codebase where it will
>> intersect code that was autovectorized to a different vector width,
>> for instance. We need the optimizer to be able to optimize and let the
>> code generator do the instruction selection.
>>
>> Nick
>>
>>> On 05/06/2014 09:22 AM, Nadav Rotem wrote:
>>>> Hi Filipe,
>>>>
>>>> Thank you for working on this. I appreciate the fact that you are
>>>> eager to improve LLVM and that you went through the trouble of testing
>>>> your code on multiple targets and on considering different
>>>> alternatives to the problem. I think that the patch below is unrelated
>>>> to the blend->select lowering so I will only address this patch. We
>>>> discussed on the possibility of creating new shuffles on the llvm
>>>> mailing lists more than once. We decided not to create new shuffles
>>>> because we may create new shuffle patterns that we can’t lower
>>>> efficiently. I ran into this problem a few times myself in different
>>>> LLVM-based projects, mainly in the domain of graphics where people
>>>> write swizzles themselves. Please do not commit this patch because it
>>>> creates new shuffles. It is not the correct approach. You may get
>>>> lucky in you selection of testcase, but I am sure that there are lots
>>>> of other shuffle inputs on X86 and other targets that this patch makes
>>>> worse.
>>>>
>>>> Thanks,
>>>> Nadav
>>>>
>>>>
>>>>> Hi Nadav,
>>>>>
>>>>> I'd really like this patch to get in, since it improves code
>>>>> generation on many targets (at the very least, I've tried arm and
>>>>> arm64 (mcpu=cyclone), ppc32 and ppc64 (mcpu=g5), and x86 and x86-64
>>>>> (mcpu=nehalem) and they all improve (they end up with no loads vs. 1
>>>>> or 2 loads, and code size is about half of what it was).
>>>>>
>>>>> Even if we'd like to make the select instruction be the proper
>>>>> canonicalization of this operation, it seems that, for now, a
>>>>> shufflevector is better when it comes to lowering it.
>>>>>
>>>>> Several targets have the action for the vselect as Expand, which will
>>>>> generate a bunch of code which would then have to be pattern-matched,
>>>>> instead of lowering the vselect as custom and falling-through to
>>>>> Expand if there's no special-case instruction/set of instructions
>>>>> that we can emit.
>>>>>
>>>>> I ran opt+llc in this way, for the variants I mentioned above
>>>>> (./bin/opt is opt with this optimization):
>>>>> ./bin/opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64
>>>>> -mcpu=cyclone > arm64-cylone-opt
>>>>> opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64
>>>>> -mcpu=cyclone > arm64-cylone-opt
>>>>>
>>>>> I can understand if we eventually drop this after we make the backend
>>>>> treat the select as the canonical form of this operation and have
>>>>> llvm always use that form, but for now it seems that the
>>>>> shufflevector is better lowered in some important targets.
>>>>>
>>>>>
>>>>> Filipe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, May 2, 2014 at 11:55 PM, Nadav Rotem <nrotem at apple.com
>>>>> <mailto:nrotem at apple.com>> wrote:
>>>>>
>>>>> Hi Filipe,
>>>>>
>>>>> We don’t generate new shufflevector instructions during
>>>>> optimizations because the lowering of shuffle instructions is
>>>>> really complicated and we don’t want to generate a shuffle that
>>>>> we don’t lower by accident.
>>>>>
>>>>> Thanks,
>>>>> Nadav
>>>>>
>>>>> On May 2, 2014, at 4:34 PM, Filipe Cabecinhas
>>>>> <filcab+llvm.phabricator at gmail.com
>>>>> <mailto:filcab%2Bllvm.phabricator at gmail.com>> wrote:
>>>>>
>>>>> > Remove unused argument
>>>>> >
>>>>> > http://reviews.llvm.org/D3561
>>>>> >
>>>>> > Files:
>>>>> > lib/Transforms/InstCombine/InstCombineSelect.cpp
>>>>> > test/Transforms/InstCombine/select.ll
>>>>> > <D3561.9050.patch>_______________________________________________
>>>>> > llvm-commits mailing list
>>>>> > llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>
>