[PATCH] Fix bug in x86's peephole optimization

Fri Sep 12 15:39:32 PDT 2014

On Fri, Sep 12, 2014 at 9:41 PM, Akira Hatanaka <ahatanak at gmail.com> wrote:
> I thought about this more, and I think it is incorrect in general (depending
> on the rounding mode) to replace the following two instructions with addsd.
>
> movsd (%rsp), %xmm0
> addpd  %xmm0, %xmm1
>
> If the higher part of %xmm1 is -0.0, the result of the addpd will be (0.0,
> %xmm1_low + %xmm0_low), since -0.0 + 0.0 = 0.0. However, if we use addsd,
> the contents of %xmm1 will be (-0.0, %xmm1_low + %xmm0_low).
>
> We can check the rounding mode to see if this optimization is legal, but I'm
> not sure if it's worthwhile.

Ah, I see. It makes sense.
I didn't think about what could happen in the presence of negative
zeros in the destination.. Sorry for the wrong suggestion.

I agree that it is not worthwhile to check for the rounding mode.
Your patch makes sense to me (although I don't think I am
knowledgeable enough to give the final approval).

Thanks,
Andrea

>
> On Fri, Sep 12, 2014 at 7:36 AM, Akira Hatanaka <ahatanak at gmail.com> wrote:
>>
>> Hi Andrea,
>>
>> I agree with your analysis.
>>
>> I think I can add an optimization in x86 custom DAGCombine. If I transform
>> (add buildvector(a,b), buildvector(c,0)) into (buildvector(add(a,c),b)),
>> code-gen emits ADDSDrm instead of ADDPDrm. If this optimization is
>> profitable for other targets too, I can probably move it to
>> targert-independent DAGCombine or implement it in InstCombine.
>>
>> We still need the check in X86InstrInfo to make sure we aren't incorrectly
>> folding MOVSDrm into ADDPDrr.
>>
>> On Thu, Sep 11, 2014 at 2:08 PM, Andrea Di Biagio
>> <andrea.dibiagio at gmail.com> wrote:
>>>
>>> Hi Akira,
>>>
>>> I agree that it is wrong to fold the sequence (MOVSDrm, ADDPDrr) into
>>> ADDPDrm because MOVSDrm would zero the upper 64-bits in the
>>> destination register.
>>> However, I think it would be valid to fold the sequence (MOVSDrm,
>>> ADDPDrr) into a single ADDSDrm. Both MOVSDrm and ADDSDrm are subject
>>> to the same alignment restrictions for the memory operand; therefore,
>>> in your example, it should be safe to replace MOVSDrm with ADDSDrm.
>>>
>>>   Example:
>>>      movsd (%rsp), %xmm0
>>>      addpd  %xmm0, %xmm1
>>>  -->
>>>      addsd  (%rsp), %xmm1
>>>
>>> The same can be done for all SSE/AVX packed fp arithmetic instructions
>>> (example: movss/d + subps/d --> subss/d; etc.).
>>>
>>> So, to me the problem in the example is that we are using the wrong
>>> opcode for the folded instruction. By disabling the folding of a
>>> zero-extending load we would fix the problem with the miscompile,
>>> however we would also miss an opportunity to further simplify the
>>> code.
>>> If you agree with this analysis, do you think it would be possible to
>>> modify the current logic to emit the correct opcode? (example: ADDSDrm
>>> instead of ADDPDrm).
>>>
>>> I hope this makes sense (I am not very familiar with method
>>> 'foldMemoryOperandImpl'..).
>>>
>>> Thanks,
>>> Andrea
>>>
>>> On Thu, Sep 11, 2014 at 6:54 PM, Akira Hatanaka <ahatanak at gmail.com>
>>> wrote:
>>> > I found a bug in X86InstrInfo::foldMemoryOperandImpl which results in
>>> > incorrectly folding a zero-extending 64-bit load from the stack into an
>>> > addpd, which is a SIMD add of two packed double precision floating
>>> > point
>>> > values.
>>> >
>>> > The machine IR looks like this before peephole optimization:
>>> >
>>> > %vreg21<def> = MOVSDrm <fi#0>, 1, %noreg, 0, %noreg;
>>> > mem:LD8[%7](align=16)(tbaa=<badref>) VR128:%vreg21
>>> > %vreg23<def,tied1> = ADDPDrr %vreg20<tied0>, %vreg21;
>>> > VR128:%vreg23,%vreg20,%vreg21
>>> >
>>> > After the optimization, MOVSDrm is folded into ADDPDrm:
>>> >
>>> > %vreg23<def,tied1> = ADDPDrm %vreg20<tied0>, <fi#0>, 1, %noreg, 0,
>>> > %noreg;
>>> > mem:LD8[%7](align=16)(tbaa=<badref>) VR128:%vreg23,%vreg20
>>> >
>>> > ADDPDrm loads a 128-bit value from the memory and adds it to %vreg20.
>>> >
>>> > X86InstrInfo::foldMemoryOperandImpl already has the logic that prevents
>>> > this
>>> > from happening (near line 4544), however the check isn't conducted for
>>> > loads
>>> > from stack objects. The attached patch factors out this logic into a
>>> > new
>>> > function and uses it for checking loads from stack slots are not
>>> > zero-extending loads.
>>> >
>>> > This fixes rdar://problem/18236850.
>>> >
>>> > _______________________________________________
>>> > llvm-commits mailing list
>>> > llvm-commits at cs.uiuc.edu
>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> >
>>
>>
>