[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Eric Christopher via llvm-dev llvm-dev at lists.llvm.org
Mon Nov 13 15:49:37 PST 2017


On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>> On 11/11/2017 09:52 PM, UE US via llvm-dev wrote:
>>
>> If skylake is that bad at AVX2
>>
>>
>> I don't think this says anything negative about AVX2, but AVX-512.
>>
>
Right. I think we're at AVX/AVX2 is "bad" on Haswell/Broadwell and AVX512
is "bad" on Skylake. At least in the "random autovectorization spread out"
aspect.


>
>>
>> it belongs in -mcpu / -march IMO.
>>
>>
>> No. We'd still want to enable the architectural features for vector
>> intrinsics and the like.
>>
>
> I took this to mean that the feature should be enabled by default for
> -march=skylake-avx512.
>


Agreed.

-eric


>
>
>
>>
>>
>> Based on the current performance data we're seeing, we think we need to
>> ultimately default skylake-avx512 to -mprefer-vector-width=256.
>>
>>
>> Craig, is this for both integer and floating-point code?
>>
>
> I believe so, but I'll try to get confirmation from the people with more
> data.
>
>
>>
>>
>>  -Hal
>>
>>    Most people will build for the standard x86_64-pc-linux or whatever
>> anyway,  and completely ignore the change. This will mainly affect those
>> who build their own software and optimize for their system, and lots there
>> have probably caught on to this already.  I always thought that's what
>> -march was made for, really.
>>
>> GNOMETOYS
>>
>> On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Yes - I was thinking of FeatureFastScalarFSQRT / FeatureFastVectorFSQRT
>>> which are used by isFsqrtCheap(). These were added to override the default
>>> x86 sqrt estimate codegen with:
>>> https://reviews.llvm.org/D21379
>>>
>>> But I'm not sure we really need that kind of hack. Can we adjust the
>>> attribute in clang based on the target cpu? Ie, if you have something like:
>>> $ clang -O2 -march=skylake-avx512 foo.c
>>>
>>> Then you can detect that in the clang driver and pass
>>> -mprefer-vector-width=256 to clang codegen as an option? Clang codegen then
>>> adds that function attribute to everything it outputs. Then, the
>>> vectorizers and/or backend detect that attribute and adjust their behavior
>>> based on it.
>>>
>>
> Do we have a precedent for setting a target independent flag from a target
> specific cpu string in the clang driver? Want to make sure I understand
> what the processing on such a thing would look like. Particularly to get
> the order right so the user can override it.
>
>
>>
>>> So I don't think we should be messing with any kind of type legality
>>> checking because that stuff should all be correct already. We're just
>>> choosing a vector size based on a pref. I think we should even allow the
>>> pref to go bigger than a legal type. This came up somewhere on llvm-dev or
>>> in a bug recently in the context of vector reductions.
>>>
>>>
>>>
>>> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper at gmail.com>
>>> wrote:
>>>
>>>> Are you referring to the X86TargetLowering::isFsqrtCheap hook?
>>>>
>>>> ~Craig
>>>>
>>>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <spatel at rotateright.com>
>>>> wrote:
>>>>
>>>>> We can tie a user preference / override to a CPU model. We do
>>>>> something like that for square root estimates already (although it does use
>>>>> a SubtargetFeature currently for x86; ideally, we'd key that off of
>>>>> something in the CPU scheduler model).
>>>>>
>>>>>
>>>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <craig.topper at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree that a less x86 specific command line makes sense. I've been
>>>>>> having an internal discussions with gcc folks and their evaluating
>>>>>> switching to something like -mprefer-vector-width=128/256/512/none
>>>>>>
>>>>>> Based on the current performance data we're seeing, we think we need
>>>>>> to ultimately default skylake-avx512 to -mprefer-vector-width=256. If we go
>>>>>> with a target independent option/implementation is there someway we could
>>>>>> still affect the default behavior in a target specific way?
>>>>>>
>>>>>> ~Craig
>>>>>>
>>>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <spatel at rotateright.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It's clear from the Intel docs how this has evolved, but from a
>>>>>>> compiler perspective, this isn't a Skylake "feature" :) ... nor an Intel
>>>>>>> feature, nor an x86 feature.
>>>>>>>
>>>>>>> It's a generic programmer hint for any target with multiple
>>>>>>> potential vector lengths.
>>>>>>>
>>>>>>> On x86, there's already a potential use case for this hint with a
>>>>>>> different starting motivation: re-vectorization. That's where we take C
>>>>>>> code that uses 128-bit vector intrinsics and selectively widen it to 256-
>>>>>>> or 512-bit vector ops based on a newer CPU target than the code was
>>>>>>> originally written for.
>>>>>>>
>>>>>>> I think it's just a matter of time before a customer requests the
>>>>>>> same ability for another target (maybe they already have and I don't know
>>>>>>> about it). So we should have a solution that recognizes that possibility.
>>>>>>>
>>>>>>> Note that having a target-independent implementation in the
>>>>>>> optimizer doesn't preclude a flag alias in clang to maintain compatibility
>>>>>>> with gcc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev <
>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote:
>>>>>>>> > That's a very good point about the ordering of the command line
>>>>>>>> options.
>>>>>>>> > gcc's current implementation treats -mprefer-avx256 has "prefer
>>>>>>>> 256 over
>>>>>>>> > 512" and -mprefer-avx128 as "prefer 128 over 256". Which feels
>>>>>>>> weird for
>>>>>>>> > other reasons, but has less of an ordering ambiguity.
>>>>>>>> >
>>>>>>>> > -mprefer-avx128 has been in gcc for many years and predates the
>>>>>>>> creation
>>>>>>>> > of
>>>>>>>> > avx512. -mprefer-avx256 was added a couple months ago.
>>>>>>>> >
>>>>>>>> > We've had an internal conversation with the implementor of
>>>>>>>> > -mprefer-avx256
>>>>>>>> > in gcc about making -mprefer-avx128 affect 512-bit vectors as
>>>>>>>> well. I'll
>>>>>>>> > bring up the ambiguity issue with them.
>>>>>>>> >
>>>>>>>> > Do we want to be compatible with gcc here?
>>>>>>>>
>>>>>>>> I certainly believe we would want to be compatible with gcc (if we
>>>>>>>> use
>>>>>>>> the same names).
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Tobias
>>>>>>>>
>>>>>>>> >
>>>>>>>> > ~Craig
>>>>>>>> >
>>>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher <
>>>>>>>> echristo at gmail.com>
>>>>>>>> > wrote:
>>>>>>>> >
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev <
>>>>>>>> > > llvm-dev at lists.llvm.org> wrote:
>>>>>>>> > >
>>>>>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev <
>>>>>>>> > >> llvm-dev at lists.llvm.org> wrote:
>>>>>>>> > >>
>>>>>>>> > >>> Hello all,
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>> I would like to propose adding the -mprefer-avx256 and
>>>>>>>> -mprefer-avx128
>>>>>>>> > >>> command line flags supported by latest GCC to clang. These
>>>>>>>> flags will be
>>>>>>>> > >>> used to limit the vector register size presented by TTI to
>>>>>>>> the vectorizers.
>>>>>>>> > >>> The backend will still be able to use wider registers for
>>>>>>>> code written
>>>>>>>> > >>> using the instrinsics in x86intrin.h. And the backend will
>>>>>>>> still be able to
>>>>>>>> > >>> use AVX512VL instructions and the additional XMM16-31 and
>>>>>>>> YMM16-31
>>>>>>>> > >>> registers.
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>> Motivation:
>>>>>>>> > >>>
>>>>>>>> > >>> -Using 512-bit operations on some Intel CPUs may cause a
>>>>>>>> decrease in CPU
>>>>>>>> > >>> frequency that may offset the gains from using the wider
>>>>>>>> register size. See
>>>>>>>> > >>> section 15.26 of IntelĀ® 64 and IA-32 Architectures
>>>>>>>> Optimization Reference
>>>>>>>> > >>> Manual published October 2017.
>>>>>>>> > >>>
>>>>>>>> > >>
>>>>>>>> > >> I note the doc mentions that 256-bit AVX operations also have
>>>>>>>> the same
>>>>>>>> > >> issue with reducing the CPU frequency, which is nice to see
>>>>>>>> documented!
>>>>>>>> > >>
>>>>>>>> > >> There's also the issues discussed here <http://www.agner.org/
>>>>>>>> > >> optimize/blog/read.php?i=165> (and elsewhere) related to
>>>>>>>> warm-up time
>>>>>>>> > >> for the 256-bit execution pipeline, which is another issue
>>>>>>>> with using
>>>>>>>> > >> wide-vector ops.
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >> -The vector ALUs on ports 0 and 1 of the Skylake Server
>>>>>>>> microarchitecture
>>>>>>>> > >>> are only 256-bits wide. 512-bit instructions using these ALUs
>>>>>>>> must use both
>>>>>>>> > >>> ports. See section 2.1 of IntelĀ® 64 and IA-32 Architectures
>>>>>>>> Optimization
>>>>>>>> > >>> Reference Manual published October 2017.
>>>>>>>> > >>>
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >>>  Implementation Plan:
>>>>>>>> > >>>
>>>>>>>> > >>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures in
>>>>>>>> X86.td not
>>>>>>>> > >>> mapped to any CPU.
>>>>>>>> > >>>
>>>>>>>> > >>> -Add mprefer-avx256 and mprefer-avx128 and the corresponding
>>>>>>>> > >>> -mno-prefer-avx128/256 options to clang's driver Options.td
>>>>>>>> file. I believe
>>>>>>>> > >>> this will allow clang to pass these straight through to the
>>>>>>>> -target-feature
>>>>>>>> > >>> attribute in IR.
>>>>>>>> > >>>
>>>>>>>> > >>> -Modify X86TTIImpl::getRegisterBitWidth to only return 512 if
>>>>>>>> AVX512 is
>>>>>>>> > >>> enabled and prefer-avx256 and prefer-avx128 is not set.
>>>>>>>> Similarly return
>>>>>>>> > >>> 256 if AVX is enabled and prefer-avx128 is not set.
>>>>>>>> > >>>
>>>>>>>> > >>
>>>>>>>> > >> Instead of multiple flags that have difficult to understand
>>>>>>>> intersecting
>>>>>>>> > >> behavior, one flag with a value would be better. E.g., what
>>>>>>>> should
>>>>>>>> > >> "-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do? No
>>>>>>>> matter the
>>>>>>>> > >> answer, it's confusing. (Similarly with other such
>>>>>>>> combinations). Just a
>>>>>>>> > >> single arg "-mprefer-avx={128/256/512}" (with no "no" version)
>>>>>>>> seems easier
>>>>>>>> > >> to understand to me (keeping the same behavior as you mention:
>>>>>>>> asking to
>>>>>>>> > >> prefer a larger width than is supported by your architecture
>>>>>>>> should be fine
>>>>>>>> > >> but ignored).
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > > I agree with this. It's a little more plumbing as far as
>>>>>>>> subtarget
>>>>>>>> > > features etc (represent via an optional value or just various
>>>>>>>> "set the avx
>>>>>>>> > > width" features - the latter being easier, but uglier),
>>>>>>>> however, it's
>>>>>>>> > > probably the right thing to do.
>>>>>>>> > >
>>>>>>>> > > I was looking at this myself just a couple weeks ago and think
>>>>>>>> this is the
>>>>>>>> > > right direction (when and how to turn things off) - and
>>>>>>>> probably makes
>>>>>>>> > > sense to be a default for these architectures? We might end up
>>>>>>>> needing to
>>>>>>>> > > check a couple of additional TTI places, but it sounds like
>>>>>>>> you're on top
>>>>>>>> > > of it. :)
>>>>>>>> > >
>>>>>>>> > > Thanks very much for doing this work.
>>>>>>>> > >
>>>>>>>> > > -eric
>>>>>>>> > >
>>>>>>>> > >
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >> There may be some other backend changes needed, but I plan to
>>>>>>>> address
>>>>>>>> > >>> those as we find them.
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>> At a later point, consider making -mprefer-avx256 the default
>>>>>>>> for
>>>>>>>> > >>> Skylake Server due to the above mentioned performance
>>>>>>>> considerations.
>>>>>>>> > >>>
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >>
>>>>>>>> > >>>
>>>>>>>> > >> Does this sound reasonable?
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>> *Latest Intel Optimization manual available here:
>>>>>>>> > >>>
>>>>>>>> https://software.intel.com/en-us/articles/intel-sdm#optimization
>>>>>>>> > >>>
>>>>>>>> > >>>
>>>>>>>> > >>> -Craig Topper
>>>>>>>> > >>>
>>>>>>>> > >>> _______________________________________________
>>>>>>>> > >>> LLVM Developers mailing list
>>>>>>>> > >>> llvm-dev at lists.llvm.org
>>>>>>>> > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>> > >>>
>>>>>>>> > >>> _______________________________________________
>>>>>>>> > >> LLVM Developers mailing list
>>>>>>>> > >> llvm-dev at lists.llvm.org
>>>>>>>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>> > >>
>>>>>>>> > >
>>>>>>>> > _______________________________________________
>>>>>>>> > LLVM Developers mailing list
>>>>>>>> > llvm-dev at lists.llvm.org
>>>>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/6bc812cc/attachment.html>


More information about the llvm-dev mailing list