[llvm-dev] RFC: [X86] Introducing command line options to prefer narrower vector instructions even when wider instructions are available

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Mon Nov 13 15:54:36 PST 2017



On 11/13/2017 05:49 PM, Eric Christopher wrote:
>
>
> On Mon, Nov 13, 2017 at 2:15 PM Craig Topper via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>     On Sat, Nov 11, 2017 at 8:52 PM, Hal Finkel via llvm-dev
>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>         On 11/11/2017 09:52 PM, UE US via llvm-dev wrote:
>>         If skylake is that bad at AVX2
>
>         I don't think this says anything negative about AVX2, but AVX-512.
>
>
> Right. I think we're at AVX/AVX2 is "bad" on Haswell/Broadwell and 
> AVX512 is "bad" on Skylake. At least in the "random autovectorization 
> spread out" aspect.
>
>
>
>>         it belongs in -mcpu / -march IMO.
>
>         No. We'd still want to enable the architectural features for
>         vector intrinsics and the like.
>
>
>     I took this to mean that the feature should be enabled by default
>     for -march=skylake-avx512.
>
>
>
> Agreed.

Yes. Also, GNOMETOYS clarified to me (off list) that is what he meant.

  -Hal

>
> -eric
>
>
>
>
>>         Based on the current performance data we're seeing, we think
>>         we need to ultimately default skylake-avx512 to
>>         -mprefer-vector-width=256.
>
>         Craig, is this for both integer and floating-point code?
>
>
>     I believe so, but I'll try to get confirmation from the people
>     with more data.
>
>
>
>          -Hal
>
>>            Most people will build for the standard x86_64-pc-linux or
>>         whatever anyway,  and completely ignore the change. This will
>>         mainly affect those who build their own software and optimize
>>         for their system, and lots there have probably caught on to
>>         this already.  I always thought that's what -march was made
>>         for, really.
>>
>>         GNOMETOYS
>>
>>         On Sat, Nov 11, 2017 at 10:25 AM, Sanjay Patel via llvm-dev
>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>             Yes - I was thinking of FeatureFastScalarFSQRT /
>>             FeatureFastVectorFSQRT which are used by isFsqrtCheap().
>>             These were added to override the default x86 sqrt
>>             estimate codegen with:
>>             https://reviews.llvm.org/D21379
>>
>>             But I'm not sure we really need that kind of hack. Can we
>>             adjust the attribute in clang based on the target cpu?
>>             Ie, if you have something like:
>>             $ clang -O2 -march=skylake-avx512 foo.c
>>
>>             Then you can detect that in the clang driver and pass
>>             -mprefer-vector-width=256 to clang codegen as an option?
>>             Clang codegen then adds that function attribute to
>>             everything it outputs. Then, the vectorizers and/or
>>             backend detect that attribute and adjust their behavior
>>             based on it.
>>
>
>     Do we have a precedent for setting a target independent flag from
>     a target specific cpu string in the clang driver? Want to make
>     sure I understand what the processing on such a thing would look
>     like. Particularly to get the order right so the user can override it.
>
>>
>>             So I don't think we should be messing with any kind of
>>             type legality checking because that stuff should all be
>>             correct already. We're just choosing a vector size based
>>             on a pref. I think we should even allow the pref to go
>>             bigger than a legal type. This came up somewhere on
>>             llvm-dev or in a bug recently in the context of vector
>>             reductions.
>>
>>
>>
>>             On Fri, Nov 10, 2017 at 6:04 PM, Craig Topper
>>             <craig.topper at gmail.com <mailto:craig.topper at gmail.com>>
>>             wrote:
>>
>>                 Are you referring to
>>                 the X86TargetLowering::isFsqrtCheap hook?
>>
>>                 ~Craig
>>
>>                 On Fri, Nov 10, 2017 at 7:39 AM, Sanjay Patel
>>                 <spatel at rotateright.com
>>                 <mailto:spatel at rotateright.com>> wrote:
>>
>>                     We can tie a user preference / override to a CPU
>>                     model. We do something like that for square root
>>                     estimates already (although it does use a
>>                     SubtargetFeature currently for x86; ideally, we'd
>>                     key that off of something in the CPU scheduler
>>                     model).
>>
>>
>>                     On Thu, Nov 9, 2017 at 4:21 PM, Craig Topper
>>                     <craig.topper at gmail.com
>>                     <mailto:craig.topper at gmail.com>> wrote:
>>
>>                         I agree that a less x86 specific command line
>>                         makes sense. I've been having an internal
>>                         discussions with gcc folks and their
>>                         evaluating switching to something like
>>                         -mprefer-vector-width=128/256/512/none
>>
>>                         Based on the current performance data we're
>>                         seeing, we think we need to ultimately
>>                         default skylake-avx512 to
>>                         -mprefer-vector-width=256. If we go with a
>>                         target independent option/implementation is
>>                         there someway we could still affect the
>>                         default behavior in a target specific way?
>>
>>                         ~Craig
>>
>>                         On Tue, Nov 7, 2017 at 9:06 AM, Sanjay Patel
>>                         <spatel at rotateright.com
>>                         <mailto:spatel at rotateright.com>> wrote:
>>
>>                             It's clear from the Intel docs how this
>>                             has evolved, but from a compiler
>>                             perspective, this isn't a Skylake
>>                             "feature" :) ... nor an Intel feature,
>>                             nor an x86 feature.
>>
>>                             It's a generic programmer hint for any
>>                             target with multiple potential vector
>>                             lengths.
>>
>>                             On x86, there's already a potential use
>>                             case for this hint with a different
>>                             starting motivation: re-vectorization.
>>                             That's where we take C code that uses
>>                             128-bit vector intrinsics and selectively
>>                             widen it to 256- or 512-bit vector ops
>>                             based on a newer CPU target than the code
>>                             was originally written for.
>>
>>                             I think it's just a matter of time before
>>                             a customer requests the same ability for
>>                             another target (maybe they already have
>>                             and I don't know about it). So we should
>>                             have a solution that recognizes that
>>                             possibility.
>>
>>                             Note that having a target-independent
>>                             implementation in the optimizer doesn't
>>                             preclude a flag alias in clang to
>>                             maintain compatibility with gcc.
>>
>>
>>
>>                             On Tue, Nov 7, 2017 at 2:02 AM, Tobias
>>                             Grosser via llvm-dev
>>                             <llvm-dev at lists.llvm.org
>>                             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>                                 On Fri, Nov 3, 2017, at 05:47, Craig
>>                                 Topper via llvm-dev wrote:
>>                                 > That's a very good point about the
>>                                 ordering of the command line options.
>>                                 > gcc's current implementation treats
>>                                 -mprefer-avx256 has "prefer 256 over
>>                                 > 512" and -mprefer-avx128 as "prefer
>>                                 128 over 256". Which feels weird for
>>                                 > other reasons, but has less of an
>>                                 ordering ambiguity.
>>                                 >
>>                                 > -mprefer-avx128 has been in gcc for
>>                                 many years and predates the creation
>>                                 > of
>>                                 > avx512. -mprefer-avx256 was added a
>>                                 couple months ago.
>>                                 >
>>                                 > We've had an internal conversation
>>                                 with the implementor of
>>                                 > -mprefer-avx256
>>                                 > in gcc about making -mprefer-avx128
>>                                 affect 512-bit vectors as well. I'll
>>                                 > bring up the ambiguity issue with them.
>>                                 >
>>                                 > Do we want to be compatible with
>>                                 gcc here?
>>
>>                                 I certainly believe we would want to
>>                                 be compatible with gcc (if we use
>>                                 the same names).
>>
>>                                 Best,
>>                                 Tobias
>>
>>                                 >
>>                                 > ~Craig
>>                                 >
>>                                 > On Thu, Nov 2, 2017 at 7:18 PM,
>>                                 Eric Christopher <echristo at gmail.com
>>                                 <mailto:echristo at gmail.com>>
>>                                 > wrote:
>>                                 >
>>                                 > >
>>                                 > >
>>                                 > > On Thu, Nov 2, 2017 at 7:05 PM
>>                                 James Y Knight via llvm-dev <
>>                                 > > llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>>                                 > >
>>                                 > >> On Wed, Nov 1, 2017 at 7:35 PM,
>>                                 Craig Topper via llvm-dev <
>>                                 > >> llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>>                                 > >>
>>                                 > >>> Hello all,
>>                                 > >>>
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> I would like to propose adding
>>                                 the -mprefer-avx256 and -mprefer-avx128
>>                                 > >>> command line flags supported by
>>                                 latest GCC to clang. These flags will be
>>                                 > >>> used to limit the vector
>>                                 register size presented by TTI to the
>>                                 vectorizers.
>>                                 > >>> The backend will still be able
>>                                 to use wider registers for code written
>>                                 > >>> using the instrinsics in
>>                                 x86intrin.h. And the backend will
>>                                 still be able to
>>                                 > >>> use AVX512VL instructions and
>>                                 the additional XMM16-31 and YMM16-31
>>                                 > >>> registers.
>>                                 > >>>
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> Motivation:
>>                                 > >>>
>>                                 > >>> -Using 512-bit operations on
>>                                 some Intel CPUs may cause a decrease
>>                                 in CPU
>>                                 > >>> frequency that may offset the
>>                                 gains from using the wider register
>>                                 size. See
>>                                 > >>> section 15.26 of IntelĀ® 64 and
>>                                 IA-32 Architectures Optimization
>>                                 Reference
>>                                 > >>> Manual published October 2017.
>>                                 > >>>
>>                                 > >>
>>                                 > >> I note the doc mentions that
>>                                 256-bit AVX operations also have the same
>>                                 > >> issue with reducing the CPU
>>                                 frequency, which is nice to see
>>                                 documented!
>>                                 > >>
>>                                 > >> There's also the issues
>>                                 discussed here <http://www.agner.org/
>>                                 > >> optimize/blog/read.php?i=165>
>>                                 (and elsewhere) related to warm-up time
>>                                 > >> for the 256-bit execution
>>                                 pipeline, which is another issue with
>>                                 using
>>                                 > >> wide-vector ops.
>>                                 > >>
>>                                 > >>
>>                                 > >> -The vector ALUs on ports 0 and
>>                                 1 of the Skylake Server microarchitecture
>>                                 > >>> are only 256-bits wide. 512-bit
>>                                 instructions using these ALUs must
>>                                 use both
>>                                 > >>> ports. See section 2.1 of
>>                                 IntelĀ® 64 and IA-32 Architectures
>>                                 Optimization
>>                                 > >>> Reference Manual published
>>                                 October 2017.
>>                                 > >>>
>>                                 > >>
>>                                 > >>
>>                                 > >>> Implementation Plan:
>>                                 > >>>
>>                                 > >>> -Add prefer-avx256 and
>>                                 prefer-avx128 as SubtargetFeatures in
>>                                 X86.td not
>>                                 > >>> mapped to any CPU.
>>                                 > >>>
>>                                 > >>> -Add mprefer-avx256 and
>>                                 mprefer-avx128 and the corresponding
>>                                 > >>> -mno-prefer-avx128/256 options
>>                                 to clang's driver Options.td file. I
>>                                 believe
>>                                 > >>> this will allow clang to pass
>>                                 these straight through to the
>>                                 -target-feature
>>                                 > >>> attribute in IR.
>>                                 > >>>
>>                                 > >>> -Modify
>>                                 X86TTIImpl::getRegisterBitWidth to
>>                                 only return 512 if AVX512 is
>>                                 > >>> enabled and prefer-avx256 and
>>                                 prefer-avx128 is not set. Similarly
>>                                 return
>>                                 > >>> 256 if AVX is enabled and
>>                                 prefer-avx128 is not set.
>>                                 > >>>
>>                                 > >>
>>                                 > >> Instead of multiple flags that
>>                                 have difficult to understand intersecting
>>                                 > >> behavior, one flag with a value
>>                                 would be better. E.g., what should
>>                                 > >> "-mprefer-avx256 -mprefer-avx128
>>                                 -mno-prefer-avx256" do? No matter the
>>                                 > >> answer, it's confusing.
>>                                 (Similarly with other such
>>                                 combinations). Just a
>>                                 > >> single arg
>>                                 "-mprefer-avx={128/256/512}" (with no
>>                                 "no" version) seems easier
>>                                 > >> to understand to me (keeping the
>>                                 same behavior as you mention: asking to
>>                                 > >> prefer a larger width than is
>>                                 supported by your architecture should
>>                                 be fine
>>                                 > >> but ignored).
>>                                 > >>
>>                                 > >>
>>                                 > > I agree with this. It's a little
>>                                 more plumbing as far as subtarget
>>                                 > > features etc (represent via an
>>                                 optional value or just various "set
>>                                 the avx
>>                                 > > width" features - the latter
>>                                 being easier, but uglier), however, it's
>>                                 > > probably the right thing to do.
>>                                 > >
>>                                 > > I was looking at this myself just
>>                                 a couple weeks ago and think this is the
>>                                 > > right direction (when and how to
>>                                 turn things off) - and probably makes
>>                                 > > sense to be a default for these
>>                                 architectures? We might end up needing to
>>                                 > > check a couple of additional TTI
>>                                 places, but it sounds like you're on top
>>                                 > > of it. :)
>>                                 > >
>>                                 > > Thanks very much for doing this work.
>>                                 > >
>>                                 > > -eric
>>                                 > >
>>                                 > >
>>                                 > >>
>>                                 > >>
>>                                 > >> There may be some other backend
>>                                 changes needed, but I plan to address
>>                                 > >>> those as we find them.
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> At a later point, consider
>>                                 making -mprefer-avx256 the default for
>>                                 > >>> Skylake Server due to the above
>>                                 mentioned performance considerations.
>>                                 > >>>
>>                                 > >>
>>                                 > >>
>>                                 > >>
>>                                 > >>
>>                                 > >>
>>                                 > >>>
>>                                 > >> Does this sound reasonable?
>>                                 > >>>
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> *Latest Intel Optimization
>>                                 manual available here:
>>                                 > >>>
>>                                 https://software.intel.com/en-us/articles/intel-sdm#optimization
>>                                 > >>>
>>                                 > >>>
>>                                 > >>> -Craig Topper
>>                                 > >>>
>>                                 > >>>
>>                                 _______________________________________________
>>                                 > >>> LLVM Developers mailing list
>>                                 > >>> llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at lists.llvm.org>
>>                                 > >>>
>>                                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                                 > >>>
>>                                 > >>>
>>                                 _______________________________________________
>>                                 > >> LLVM Developers mailing list
>>                                 > >> llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at lists.llvm.org>
>>                                 > >>
>>                                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                                 > >>
>>                                 > >
>>                                 >
>>                                 _______________________________________________
>>                                 > LLVM Developers mailing list
>>                                 > llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at lists.llvm.org>
>>                                 >
>>                                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                                 _______________________________________________
>>                                 LLVM Developers mailing list
>>                                 llvm-dev at lists.llvm.org
>>                                 <mailto:llvm-dev at lists.llvm.org>
>>                                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>>
>>
>>
>>             _______________________________________________
>>             LLVM Developers mailing list
>>             llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>             http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>>         _______________________________________________
>>         LLVM Developers mailing list
>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>         -- 
>         Hal Finkel
>         Lead, Compiler Technology and Programming Languages
>         Leadership Computing Facility
>         Argonne National Laboratory
>
>
>         _______________________________________________
>         LLVM Developers mailing list
>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171113/d4ec460d/attachment-0001.html>


More information about the llvm-dev mailing list