[PATCH] Optimize insertqi when we copy all the lower 64 bits.

Wed Apr 23 17:14:44 PDT 2014

Agreed. If we did this as neutral IR today we’d end up with problems with LICM and such hoisting the logical ops into a different block than the mask instruction, breaking our ability to match to the instruction. We have similar problems with other operations. When it’s easy, we teach codegenprep to sink the ops back down to enable matching (ugly) or use an intrinsic to represent the whole operation like this does so that the splitting won’t happen (also ugly).

Eventually this is exactly the sort of problem global ISel will help us solve, but that day is not today, unfortunately.

-Jim

On Apr 23, 2014, at 5:00 PM, Nadav Rotem <nrotem at apple.com> wrote:

> 
> On Apr 23, 2014, at 2:19 PM, Filipe Cabecinhas <filcab+llvm.phabricator at gmail.com> wrote:
> 
>> How could we use insertelement here?
>> Should we just (on the clang side) bitcast the vectors to <128 x i1> and use extractelement + insertelement?
>> That seems… hard to match on the output side.
> 
> After reading your comment I realized that insertqi does not only insert elements. It is actually copying n bits at location m into the destination.  I think that we can represent this as a sequence of logical operations with a constant mask. Later we would need to pattern match this instruction in isel. 
> 
> I think that eventually we should lower this intrinsic to an IR, but until we do I am okay with the current patch. 
> 
> Thanks,
> Nadav
> 
>> 
>> The instruction copies _bits_ from one vector to another. We can special-case it when we're copying from an 8/16/32/64 bit boundary, for a multiple of that amount of bits, but doing the generic instruction in IR doesn't seem that easy to do.
>> 
>> I've also looked a bit more at SelectionDAG and I think it might be worth it to move this optimization there. But I don't know if it could handle the sequence of insertqi -> copy source vector, when the insertqi ranges add up to [0,64).
>> 
>> Which of these would be better?
>> 
>> Regards,
>>   Filipe
>> 
>> 
>> On Wed, Apr 23, 2014 at 9:16 AM, Rafael Avila de Espindola <rafael.espindola at gmail.com> wrote:
>> 
>> 
>> Sent from my iPhone
>> 
>> > On Apr 23, 2014, at 12:08, Nadav Rotem <nrotem at apple.com> wrote:
>> >
>> >
>> >> On Apr 23, 2014, at 6:56 AM, Rafael Espíndola <rafael.espindola at gmail.com> wrote:
>> >>
>> >>> On 15 April 2014 14:04, Nadav Rotem <nrotem at apple.com> wrote:
>> >>> Hi Filipe,
>> >>>
>> >>> Why is this an IR-level transform? Could you implement this in SelectionDAG ?
>> >>
>> >> What is the advantage of doing this at SelectionDAG? Since this is an
>> >> intrinsic, we know all that we need at the IR level already. IR also
>> >> has the advantage of opening the potential for further optimizations
>> > It is not clear to me why we represent this intrinsic as an IR-level intrinsic and not as a regular insertelement instruction. We already have IR-level optimizations on insertelement and I prefer not to duplicate all of them.
>> >
>> 
>> That I fully agree with. If the operation can be represented with generic ir that is by far the best solution.
>> 
>> 
>> >> an has much better testing than SelecetionDAG.
>> >>
>> >> Cheers,
>> >> Rafael
>> >
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140423/3cae219e/attachment.html>