[cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

Thu May 12 16:32:55 PDT 2016

> -----Original Message-----
> From: cfe-dev [mailto:cfe-dev-bounces at lists.llvm.org] On Behalf Of Hal
> Finkel via cfe-dev
> Sent: Thursday, May 12, 2016 4:14 PM
> To: Reid Kleckner
> Cc: asaf badouh; David Majnemer; Michael zuckerman; cfe-dev; Elena
> Demikhovsky
> Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too
> big. What to do about it?
> 
> ----- Original Message -----
> > From: "Reid Kleckner via cfe-dev" <cfe-dev at lists.llvm.org>
> > To: "Nico Weber" <thakis at chromium.org>, "David Majnemer"
> <majnemer at google.com>
> > Cc: "Elena Demikhovsky" <elena.demikhovsky at intel.com>, "cfe-dev" <cfe-
> dev at lists.llvm.org>, "asaf badouh"
> > <asaf.badouh at intel.com>, "Michael zuckerman"
> <Michael.zuckerman at intel.com>
> > Sent: Thursday, May 12, 2016 6:10:06 PM
> > Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are
> too big. What to do about it?
> >
> > I think our approach to the mmintrin headers doesn't scale. We're
> > creating the windows.h of intel intrinsics in immintrin.h.
> >
> > When they were first created, a large percentage of the intrinsics
> > were mapping from hyper-specific instruction names to generic vector
> > math operations like this:
> >
> > static __inline__ __m128 __DEFAULT_FN_ATTRS
> > _mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }
> >
> > This made a lot of sense at the time, because we could just write
> > come
> > C and not worry about teaching clang and LLVM about every Intel
> > intrinsic under the sun.
> >
> > From looking at the avx512 headers, it seems this is no longer the
> > case. Now we are mostly mapping from _mm_* intrinsic to
> > __builtin_ia32_ function.
> >
> > If this continues to be the case going forward,
> 
> Indeed. It is not clear to me, however, that this situation is desirable.
> We had a general policy that our intrinsics headers should generate
> generic IR whenever possible, and if we've strayed from that, we should
> discuss that first.

If you look at the history of some of the headers, they used to map the
intrinsic function names to builtins.  As codegen got smarter over time,
many of these were converted to generic C, and the builtins could go away.
A *lot* of the intrinsics didn't start out as generic C.

(I personally spent probably months of my life merging the evolution of
the intrinsics and builtins and tablegen instruction definitions for a
variety of X86 instruction subsets into our local tree.)

> 
>  -Hal
> 
> > then I think we
> > should
> > make the _mm* intrinsics into compiler builtins like the
> > __builtin_ia32 functions. It also avoids the need for those ugly
> > forwarding macros for intrinsics that take arguments that must be
> > constant.

If you do that, then there is less motivation to make codegen smarter.
--paulr

> >
> > The _mm_* builtins should only be available if the user includes
> > <immintrin.h>. We can replace the contents of that file with a pragma
> > that just says "enable all intel intrinsics".
> >
> > On Thu, May 12, 2016 at 9:16 AM, Nico Weber via cfe-dev
> > <cfe-dev at lists.llvm.org> wrote:
> > > Hi,
> > >
> > > on Windows, C++ system headers like e.g. <string> end up pulling in
> > > intrin.h. clang's intrinsic headers are very large.
> > >
> > > If you take a cc file containing just `#include <string>` and run
> > > that
> > > through the preprocessor with `cl /P test.cc` and `clang-cl /P
> > > test.cc`, the
> > > test.I file generated by clang-cl is 1.7MB while the one created by
> > > cl.exe
> > > is 0.7MB. This is solely due to clang's intrin.h expanding to way
> > > more
> > > stuff.
> > >
> > > The biggest offenders are avx512vlintrin.h, avx512fintrin.h,
> > > avx512vlbwintrin.h which add up to 657kB already. Before r239883,
> > > we only
> > > included avx headers if __AVX512F__ etc was defined. This is
> > > currently never
> > > the case in practice. Later (r243394 r243402 r243406 and more), the
> > > avx
> > > headers got much bigger.
> > >
> > > Parsing all this code takes time -- removing the avx512 includes
> > > from
> > > immintrin.h locally makes compiling a file containing just the
> > > <string>
> > > header 0.25s faster (!), and building all of v8 gets 6% faster,
> > > just from
> > > not including the avx512 headers.
> > >
> > > What can we do about this? Since avx512 is new, maybe they could be
> > > not part
> > > of immintrin.h? Or we could re-introduce
> > >
> > >   #if !__has_feature(modules) && defined(__AVX512BW__)
> > >
> > > include guards in immintrin.h. This would give us a speed win
> > > immediately
> > > without drawbacks as far as I can see, but in a few years when
> > > people start
> > > compiling with /arch:avx512 that'd go away again. (Then again, by
> > > then,
> > > modules are hopefully commonly available. cl.exe doesn't have an
> > > /arch:avx512 switch yet, so this is probably several years away
> > > from
> > > happening.)
> > >
> > > Comments? Is it feasible to require that people who want to use
> > > avx512
> > > include a new header instead of immintrin.h? Else, does anyone have
> > > a better
> > > idea other than reintroducing the #ifdefs, augmented with the
> > > module check?
> > >
> > > Thanks,
> > > Nico
> > >
> > > _______________________________________________
> > > cfe-dev mailing list
> > > cfe-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> > >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev