[LLVMdev] Proposal to improve vzeroupper optimization strategy

Fri Sep 20 18:23:04 PDT 2013

Hey Sean,

On Fri, Sep 20, 2013 at 8:07 PM, Sean Silva <chisophugis at gmail.com> wrote:
> Is it realistic to worry about performance of vectorized code that does PIC
> calls into a non-vectorized sin() in libc? Maybe there's an example other
> than sin() that is more realistic?
>
> -- Sean Silva
>
>
> On Fri, Sep 20, 2013 at 7:11 PM, Eli Friedman <eli.friedman at gmail.com>
> wrote:
>>
>> On Fri, Sep 20, 2013 at 2:58 PM, Gao, Yunzhong
>> <yunzhong_gao at playstation.sony.com> wrote:
>>>
>>> Hi Eli,
>>>
>>> Thanks for the feedback. Please see below.
>>> - Gao.
>>>
>>>
>>>
>>> From: Eli Friedman [mailto:eli.friedman at gmail.com]
>>>
>>> Sent: Thursday, September 19, 2013 12:31 PM
>>>
>>> To: Gao, Yunzhong
>>>
>>> Cc: llvmdev at cs.uiuc.edu
>>>
>>> Subject: Re: [LLVMdev] Proposal to improve vzeroupper optimization
>>> strategy
>>>
>>>
>>>
>>> > This is essentially equivalent to "don't insert vzeroupper anywhere",
>>> > as
>>>
>>> > far as I can tell. (The case of SSE instructions without a v- prefixed
>>>
>>> > equivalent is rare enough we can separate it from this discussion.)
>>>
>>>
>>>
>>> So will you be interested in a patch that disables vzeroupper by default?
>>
>>
>> A patch which adds a switch/LLVM IR function attribute to disable
>> vzeroupper would be fine.  A patch that disables vzeroupper on your platform
>> would be fine (assuming the target triple is distinguishable).  Turning off
>> vzeroupper by default on all platforms is not fine.
>>
>>>
>>> I implemented this possibly over-engineering solution in our local tree
>>> to work
>>>
>>> around some bad instruction selection issues in LLVM backend. When
>>> benchmarking
>>>
>>> on our game codes, I noticed that sometimes legacy SSE instructions were
>>>
>>> selected despite existence of AVX equivalent, in which case the
>>> vzeroupper
>>>
>>> instruction was needed. And it is much easier to detect existence of
>>> vzeroupper
>>>
>>> instruction than to detect each single legacy SSE instructions.
>>>
>>>
>>>
>>> The instruction selection issues were later fixed in our tree (patches to
>>> be
>>>
>>> submitted later), at least for the handful of games I tested on. So a
>>> simple
>>>
>>> change to just disable vzeroupper by default will be acceptable to us as
>>> well.
>>>
>>>
>>>
>>> > The reason we need vzeroupper in the first place is because we can't
>>> > assume
>>>
>>> > other functions won't use legacy SSE instructions; for example, on most
>>>
>>> > systems, calling sin() will use legacy SSE instructions.  I mean, if
>>> > you can
>>>
>>> > make some unusual guarantee about your platform, it might make sense to
>>>
>>> > disable vzeroupper generation in general, but it simply doesn't make
>>> > sense
>>>
>>> > on most platforms.
>>>
>>>
>>>
>>> I am confused by this point. By "most systems," do you have in mind a
>>> platform
>>>
>>> where the sin() function was compiled by gcc but the application codes
>>> were
>>>
>>> compiled by clang?
>>
>>
>> On, for example, OS X, AVX is not enabled by default, so the sin()
>> function uses legacy SSE instructions.  Users can still turn on AVX in their
>> applications.
>>
>> -Eli

On our systems, there are several libraries that are not compiled for
a particular target at default. The reason being is that we support
many targets and choose the lowest common denominator for packaging
reasons. Also, we have no control over how user libraries are compiled
(compiler or target).

Let's also note that the offending legacy SSE call does not need to be
found within vectorized code. It just has to occur after vectorized
code to incur the transition penalty. For example:

void kung() {
  ... vectorized VEX.256 code ...

  ... lots of scalar VEX.128 code ...

  while(x < y) {
    ... vzeroupper ...
    ... call to legacy SSE function ...
    x++;
  }
}

On a side note, in such a situation, we found it most profitable to
hoist the vzeroupper out of the loop, so that it is only executed once
as needed. Even further, we sacrifice a good amount of compile time to
find near-optimal vzeroupper placement, with noticeable impact on
performance.

Hope that helps,
Cameron