[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Tue Nov 14 10:58:31 PST 2017

Erich Keane, just brought up an extra complication with clang trying to
detect -march=skylake-avx512 and set this new attribute. It's not just the
command line that we need to worry about. We also need to support it when
arch=skylake-avx512 appears in a target function attribute. I need to see
if gcc supports prefer-avx128 in the target attribute too. Cause you might
want override this on a per function basis. I'm not even sure I know how
command line options and target attribute interact today.

~Craig

On Tue, Nov 14, 2017 at 10:10 AM, Sanjay Patel <spatel at rotateright.com>
wrote:

> I haven't looked into actually implementing revectorization, so we may
> just want to ignore that possibility for now.
>
> But I imagined that revectorization could hit the same problem that we're
> trying to avoid here: if the cost models say that wider vectors are legal
> and cheaper, but the reality is that perf will suffer when using those
> wider vectors, then we want to avoid using the wider ops. The user
> pref/override will be taken into account when deciding if we should go
> wider.
>
> In either scenario, we're not actually removing or limiting vector widths,
> right? They're still legal as far as the ISA is concerned. We're just
> avoiding those ops because the programmer and/or the CPU model says we'll
> do better with narrower ops.
>
>
> On Tue, Nov 14, 2017 at 10:26 AM, Craig Topper via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> For the re-vectorization case mentioned by Sanjay. That seems like a
>> different type of limit than what's being proposed here. For
>> revectorization you want to remove smaller vector widths. This is removing
>> larger vector widths. I don't think we want the -mprefer-vector-width=256
>> being proposed here to say we can't do 128-bit vectors with the 256-bit.
>> Maybe this should be called -mlimit-vector-width?
>>
>> Its not clear to be why revectorization would need a preference at all?
>> Shouldn't we be able to decide from the cost models? We go from scalar to
>> vector today based on cost models. Why couldn't we go from vector to wider
>> vector?
>>
>> ~Craig
>>
>> On Mon, Nov 13, 2017 at 3:54 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>>
>>> On 11/13/2017 05:49 PM, Eric Christopher wrote:
>>>
>>>
>>>
>>> On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>>
>>>>> On 11/11/2017 09:52 PM, UE US via llvm-dev wrote:
>>>>>
>>>>> If skylake is that bad at AVX2
>>>>>
>>>>>
>>>>> I don't think this says anything negative about AVX2, but AVX-512.
>>>>>
>>>>
>>> Right. I think we're at AVX/AVX2 is "bad" on Haswell/Broadwell and
>>> AVX512 is "bad" on Skylake. At least in the "random autovectorization
>>> spread out" aspect.
>>>
>>>
>>>>
>>>>>
>>>>> it belongs in -mcpu / -march IMO.
>>>>>
>>>>>
>>>>> No. We'd still want to enable the architectural features for vector
>>>>> intrinsics and the like.
>>>>>
>>>>
>>>> I took this to mean that the feature should be enabled by default for
>>>> -march=skylake-avx512.
>>>>
>>>
>>>
>>> Agreed.
>>>
>>>
>>> Yes. Also, GNOMETOYS clarified to me (off list) that is what he meant.
>>>
>>>  -Hal
>>>
>>>
>>>
>>> -eric
>>>
>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Based on the current performance data we're seeing, we think we need
>>>>> to ultimately default skylake-avx512 to -mprefer-vector-width=256.
>>>>>
>>>>>
>>>>> Craig, is this for both integer and floating-point code?
>>>>>
>>>>
>>>> I believe so, but I'll try to get confirmation from the people with
>>>> more data.
>>>>
>>>>
>>>>>
>>>>>
>>>>>  -Hal
>>>>>
>>>>>    Most people will build for the standard x86_64-pc-linux or whatever
>>>>> anyway,  and completely ignore the change. This will mainly affect those
>>>>> who build their own software and optimize for their system, and lots there
>>>>> have probably caught on to this already.  I always thought that's what
>>>>> -march was made for, really.
>>>>>
>>>>> GNOMETOYS
>>>>>
>>>>> On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Yes - I was thinking of FeatureFastScalarFSQRT /
>>>>>> FeatureFastVectorFSQRT which are used by isFsqrtCheap(). These were added
>>>>>> to override the default x86 sqrt estimate codegen with:
>>>>>> https://reviews.llvm.org/D21379
>>>>>>
>>>>>> But I'm not sure we really need that kind of hack. Can we adjust the
>>>>>> attribute in clang based on the target cpu? Ie, if you have something like:
>>>>>> $ clang -O2 -march=skylake-avx512 foo.c
>>>>>>
>>>>>> Then you can detect that in the clang driver and pass
>>>>>> -mprefer-vector-width=256 to clang codegen as an option? Clang codegen then
>>>>>> adds that function attribute to everything it outputs. Then, the
>>>>>> vectorizers and/or backend detect that attribute and adjust their behavior
>>>>>> based on it.
>>>>>>
>>>>>
>>>> Do we have a precedent for setting a target independent flag from a
>>>> target specific cpu string in the clang driver? Want to make sure I
>>>> understand what the processing on such a thing would look like.
>>>> Particularly to get the order right so the user can override it.
>>>>
>>>>
>>>>>
>>>>>> So I don't think we should be messing with any kind of type legality
>>>>>> checking because that stuff should all be correct already. We're just
>>>>>> choosing a vector size based on a pref. I think we should even allow the
>>>>>> pref to go bigger than a legal type. This came up somewhere on llvm-dev or
>>>>>> in a bug recently in the context of vector reductions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper <craig.topper at gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Are you referring to the X86TargetLowering::isFsqrtCheap hook?
>>>>>>>
>>>>>>> ~Craig
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel <
>>>>>>> spatel at rotateright.com> wrote:
>>>>>>>
>>>>>>>> We can tie a user preference / override to a CPU model. We do
>>>>>>>> something like that for square root estimates already (although it does use
>>>>>>>> a SubtargetFeature currently for x86; ideally, we'd key that off of
>>>>>>>> something in the CPU scheduler model).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper <
>>>>>>>> craig.topper at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I agree that a less x86 specific command line makes sense. I've
>>>>>>>>> been having an internal discussions with gcc folks and their evaluating
>>>>>>>>> switching to something like -mprefer-vector-width=128/256/512/none
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Based on the current performance data we're seeing, we think we
>>>>>>>>> need to ultimately default skylake-avx512 to -mprefer-vector-width=256. If
>>>>>>>>> we go with a target independent option/implementation is there someway we
>>>>>>>>> could still affect the default behavior in a target specific way?
>>>>>>>>>
>>>>>>>>> ~Craig
>>>>>>>>>
>>>>>>>>> On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel <
>>>>>>>>> spatel at rotateright.com> wrote:
>>>>>>>>>
>>>>>>>>>> It's clear from the Intel docs how this has evolved, but from a
>>>>>>>>>> compiler perspective, this isn't a Skylake "feature" :) ... nor an Intel
>>>>>>>>>> feature, nor an x86 feature.
>>>>>>>>>>
>>>>>>>>>> It's a generic programmer hint for any target with multiple
>>>>>>>>>> potential vector lengths.
>>>>>>>>>>
>>>>>>>>>> On x86, there's already a potential use case for this hint with a
>>>>>>>>>> different starting motivation: re-vectorization. That's where we take C
>>>>>>>>>> code that uses 128-bit vector intrinsics and selectively widen it to 256-
>>>>>>>>>> or 512-bit vector ops based on a newer CPU target than the code was
>>>>>>>>>> originally written for.
>>>>>>>>>>
>>>>>>>>>> I think it's just a matter of time before a customer requests the
>>>>>>>>>> same ability for another target (maybe they already have and I don't know
>>>>>>>>>> about it). So we should have a solution that recognizes that possibility.
>>>>>>>>>>
>>>>>>>>>> Note that having a target-independent implementation in the
>>>>>>>>>> optimizer doesn't preclude a flag alias in clang to maintain compatibility
>>>>>>>>>> with gcc.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 7, 2017 at 2:02 AM, Tobias Grosser via llvm-dev <
>>>>>>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 3, 2017, at 05:47, Craig Topper via llvm-dev wrote:
>>>>>>>>>>> > That's a very good point about the ordering of the command
>>>>>>>>>>> line options.
>>>>>>>>>>> > gcc's current implementation treats -mprefer-avx256 has
>>>>>>>>>>> "prefer 256 over
>>>>>>>>>>> > 512" and -mprefer-avx128 as "prefer 128 over 256". Which feels
>>>>>>>>>>> weird for
>>>>>>>>>>> > other reasons, but has less of an ordering ambiguity.
>>>>>>>>>>> >
>>>>>>>>>>> > -mprefer-avx128 has been in gcc for many years and predates
>>>>>>>>>>> the creation
>>>>>>>>>>> > of
>>>>>>>>>>> > avx512. -mprefer-avx256 was added a couple months ago.
>>>>>>>>>>> >
>>>>>>>>>>> > We've had an internal conversation with the implementor of
>>>>>>>>>>> > -mprefer-avx256
>>>>>>>>>>> > in gcc about making -mprefer-avx128 affect 512-bit vectors as
>>>>>>>>>>> well. I'll
>>>>>>>>>>> > bring up the ambiguity issue with them.
>>>>>>>>>>> >
>>>>>>>>>>> > Do we want to be compatible with gcc here?
>>>>>>>>>>>
>>>>>>>>>>> I certainly believe we would want to be compatible with gcc (if
>>>>>>>>>>> we use
>>>>>>>>>>> the same names).
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Tobias
>>>>>>>>>>>
>>>>>>>>>>> >
>>>>>>>>>>> > ~Craig
>>>>>>>>>>> >
>>>>>>>>>>> > On Thu, Nov 2, 2017 at 7:18 PM, Eric Christopher <
>>>>>>>>>>> echristo at gmail.com>
>>>>>>>>>>> > wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > >
>>>>>>>>>>> > >
>>>>>>>>>>> > > On Thu, Nov 2, 2017 at 7:05 PM James Y Knight via llvm-dev <
>>>>>>>>>>> > > llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>>> > >
>>>>>>>>>>> > >> On Wed, Nov 1, 2017 at 7:35 PM, Craig Topper via llvm-dev <
>>>>>>>>>>> > >> llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>> Hello all,
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> I would like to propose adding the -mprefer-avx256 and
>>>>>>>>>>> -mprefer-avx128
>>>>>>>>>>> > >>> command line flags supported by latest GCC to clang. These
>>>>>>>>>>> flags will be
>>>>>>>>>>> > >>> used to limit the vector register size presented by TTI to
>>>>>>>>>>> the vectorizers.
>>>>>>>>>>> > >>> The backend will still be able to use wider registers for
>>>>>>>>>>> code written
>>>>>>>>>>> > >>> using the instrinsics in x86intrin.h. And the backend will
>>>>>>>>>>> still be able to
>>>>>>>>>>> > >>> use AVX512VL instructions and the additional XMM16-31 and
>>>>>>>>>>> YMM16-31
>>>>>>>>>>> > >>> registers.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> Motivation:
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> -Using 512-bit operations on some Intel CPUs may cause a
>>>>>>>>>>> decrease in CPU
>>>>>>>>>>> > >>> frequency that may offset the gains from using the wider
>>>>>>>>>>> register size. See
>>>>>>>>>>> > >>> section 15.26 of Intel® 64 and IA-32 Architectures
>>>>>>>>>>> Optimization Reference
>>>>>>>>>>> > >>> Manual published October 2017.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >> I note the doc mentions that 256-bit AVX operations also
>>>>>>>>>>> have the same
>>>>>>>>>>> > >> issue with reducing the CPU frequency, which is nice to see
>>>>>>>>>>> documented!
>>>>>>>>>>> > >>
>>>>>>>>>>> > >> There's also the issues discussed here <
>>>>>>>>>>> http://www.agner.org/
>>>>>>>>>>> > >> optimize/blog/read.php?i=165> (and elsewhere) related to
>>>>>>>>>>> warm-up time
>>>>>>>>>>> > >> for the 256-bit execution pipeline, which is another issue
>>>>>>>>>>> with using
>>>>>>>>>>> > >> wide-vector ops.
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >> -The vector ALUs on ports 0 and 1 of the Skylake Server
>>>>>>>>>>> microarchitecture
>>>>>>>>>>> > >>> are only 256-bits wide. 512-bit instructions using these
>>>>>>>>>>> ALUs must use both
>>>>>>>>>>> > >>> ports. See section 2.1 of Intel® 64 and IA-32
>>>>>>>>>>> Architectures Optimization
>>>>>>>>>>> > >>> Reference Manual published October 2017.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>>  Implementation Plan:
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> -Add prefer-avx256 and prefer-avx128 as SubtargetFeatures
>>>>>>>>>>> in X86.td not
>>>>>>>>>>> > >>> mapped to any CPU.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> -Add mprefer-avx256 and mprefer-avx128 and the
>>>>>>>>>>> corresponding
>>>>>>>>>>> > >>> -mno-prefer-avx128/256 options to clang's driver
>>>>>>>>>>> Options.td file. I believe
>>>>>>>>>>> > >>> this will allow clang to pass these straight through to
>>>>>>>>>>> the -target-feature
>>>>>>>>>>> > >>> attribute in IR.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> -Modify X86TTIImpl::getRegisterBitWidth to only return
>>>>>>>>>>> 512 if AVX512 is
>>>>>>>>>>> > >>> enabled and prefer-avx256 and prefer-avx128 is not set.
>>>>>>>>>>> Similarly return
>>>>>>>>>>> > >>> 256 if AVX is enabled and prefer-avx128 is not set.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >> Instead of multiple flags that have difficult to understand
>>>>>>>>>>> intersecting
>>>>>>>>>>> > >> behavior, one flag with a value would be better. E.g., what
>>>>>>>>>>> should
>>>>>>>>>>> > >> "-mprefer-avx256 -mprefer-avx128 -mno-prefer-avx256" do? No
>>>>>>>>>>> matter the
>>>>>>>>>>> > >> answer, it's confusing. (Similarly with other such
>>>>>>>>>>> combinations). Just a
>>>>>>>>>>> > >> single arg "-mprefer-avx={128/256/512}" (with no "no"
>>>>>>>>>>> version) seems easier
>>>>>>>>>>> > >> to understand to me (keeping the same behavior as you
>>>>>>>>>>> mention: asking to
>>>>>>>>>>> > >> prefer a larger width than is supported by your
>>>>>>>>>>> architecture should be fine
>>>>>>>>>>> > >> but ignored).
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > > I agree with this. It's a little more plumbing as far as
>>>>>>>>>>> subtarget
>>>>>>>>>>> > > features etc (represent via an optional value or just
>>>>>>>>>>> various "set the avx
>>>>>>>>>>> > > width" features - the latter being easier, but uglier),
>>>>>>>>>>> however, it's
>>>>>>>>>>> > > probably the right thing to do.
>>>>>>>>>>> > >
>>>>>>>>>>> > > I was looking at this myself just a couple weeks ago and
>>>>>>>>>>> think this is the
>>>>>>>>>>> > > right direction (when and how to turn things off) - and
>>>>>>>>>>> probably makes
>>>>>>>>>>> > > sense to be a default for these architectures? We might end
>>>>>>>>>>> up needing to
>>>>>>>>>>> > > check a couple of additional TTI places, but it sounds like
>>>>>>>>>>> you're on top
>>>>>>>>>>> > > of it. :)
>>>>>>>>>>> > >
>>>>>>>>>>> > > Thanks very much for doing this work.
>>>>>>>>>>> > >
>>>>>>>>>>> > > -eric
>>>>>>>>>>> > >
>>>>>>>>>>> > >
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >> There may be some other backend changes needed, but I plan
>>>>>>>>>>> to address
>>>>>>>>>>> > >>> those as we find them.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> At a later point, consider making -mprefer-avx256 the
>>>>>>>>>>> default for
>>>>>>>>>>> > >>> Skylake Server due to the above mentioned performance
>>>>>>>>>>> considerations.
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >> Does this sound reasonable?
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> *Latest Intel Optimization manual available here:
>>>>>>>>>>> > >>> https://software.intel.com/en-
>>>>>>>>>>> us/articles/intel-sdm#optimization
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> -Craig Topper
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> _______________________________________________
>>>>>>>>>>> > >>> LLVM Developers mailing list
>>>>>>>>>>> > >>> llvm-dev at lists.llvm.org
>>>>>>>>>>> > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>> > >>>
>>>>>>>>>>> > >>> _______________________________________________
>>>>>>>>>>> > >> LLVM Developers mailing list
>>>>>>>>>>> > >> llvm-dev at lists.llvm.org
>>>>>>>>>>> > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>> > >>
>>>>>>>>>>> > >
>>>>>>>>>>> > _______________________________________________
>>>>>>>>>>> > LLVM Developers mailing list
>>>>>>>>>>> > llvm-dev at lists.llvm.org
>>>>>>>>>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> LLVM Developers mailing list
>>>>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171114/b5d7bd37/attachment-0001.html>