[llvm] r223360 - [X86] Improve a dag-combine that handles a vector extract -> zext sequence.

Mon Dec 8 13:10:24 PST 2014

On Dec 4, 2014, at 11:10 AM, Quentin Colombet <qcolombet at apple.com> wrote:

> Hi,
> 
> On Dec 4, 2014, at 9:19 AM, Chandler Carruth <chandlerc at google.com> wrote:
> 
>> 
>> On Thu, Dec 4, 2014 at 9:05 AM, Kuperstein, Michael M <michael.m.kuperstein at intel.com> wrote:
>> It looks like in the cited PR it was the best sequence, but I agree with you, it may not be the case globally.
>> 
>> Which stalls are you talking about? I think domain crossing shouldn’t be a problem in this case, as the zexts would imply you want to be in the integer domain.
>> 
>> 
>> The domain cross as I understand it (and feel free to shed more detailed light on this aspect of Intel chips if you can, but I've failed to get any better clarification from Intel folks in the past) is more problematic than that.
>> 
>> It stems from separate execution units of some form (which form, and whether the "ports" as described in modern Intel manuals attach to them or are fixed to them isn't really important). Moving data in a register from one unit to the other unit stalls. This is just as true (if not more true) moving data from an integer xmm register into a gpr as it is moving data produced in the floating point vector unit to an input of an integer vector unit instruction.
>> 
>> Previously, the *primary* cause of vector shuffle performance problems in the x86 backend was because it heavily relied on pextr and pinsr sequences to manually extract and insert the elements into the desired positions. But the slow downs were vastly out of proportion to the number of instructions different. The best explanation, and one supported by various timings indications in Agners and elsewhere, is that there is a rather massive penalty incurred in sequences of these instructions. In my benchmarking, I routinely saw this penalty be much higher than that of domain crossing between integer and floating point units on Intel chips. On AMD chips, the penalties were more even, but were also both significantly higher than on Intel chips.
>>  
>> 
>>  
>> 
>> Regarding systematic testing – no, since this is a fairly specific pattern.
>> 
>> Do you have any examples in mind that will match this, but be negatively impacted?
>> 
>> 
>> I would start off with checking LNT, maybe SPEC (although I'm loath to trust SPEC numbers for this kind of change).
>>  
>> 
>>  
>> 
>> Regarding patterns impacted by this - if I understand correctly, the pattern that this was introduced to catch was precisely the one the LIT test checks – 64-bit GEPs that use indexes extracted from a 4xi32 vector. There’s a rdar linked to the test.  Quentin, do you think it’s worth checking what the impact of this is on the original issue?
>> 
>> 
> 
> I’ll have a look at the original radar and I’ll let you know.

This transformation implies a 1% regression there.
I am not sure if this is the transformation itself or if it uncovers a poor codegen for some shuffle.
Indeed, in this specific example, instead of having vpextr + 1 mov from XMM to GPR, we have a vpshuf + 2 mov from XMM to GPR.

Trying to produce a reduce test case.

Cheers,
Q.

> 
>> 
>> This also might be uncovered by checking the LNT results.
>> 
>> 
>> All this said, I'm not certain of anything here. Maybe this is a strict win. I just think it needs more broad measurements than the PR shows.
> 
> I agree with Chandler and in fact I thought that has been done. Therefore, by all means, please do performance measurements.
> 
> Thanks,
> Q.
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141208/0f353345/attachment.html>