[llvm-dev] New x86-64 micro-architecture levels

Fri Jul 10 10:30:09 PDT 2020

Most Linux distributions still compile against the original x86-64
baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel
EM64T compatibility).

There has been an attempt to use the existing AT_PLATFORM-based loading
mechanism in the glibc dynamic linker to enable a selection of optimized
libraries.  But the general selection mechanism in glibc is problematic:

  hwcaps subdirectory selection in the dynamic loader
  <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html>

We also have the problem that the glibc version of "haswell" is distinct
from GCC's -march=haswell (and presumably other compilers):

  Definition of "haswell" platform is inconsistent with GCC 
  <https://sourceware.org/bugzilla/show_bug.cgi?id=24080>

And that the selection criteria are not what people expect:

  Epyc and other current AMD CPUs do not select the "haswell" platform
  subdirectory
  <https://sourceware.org/bugzilla/show_bug.cgi?id=23249>

Since the hwcaps-based selection does not work well regardless of
architecture (even in cases the kernel provides glibc with data), I
worked on a new mechanism that does not have the problems associated
with the old mechanism:

  [PATCH 00/30] RFC: elf: glibc-hwcaps support
  <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html>

(Don't be concerned that these patches have not been reviewed; we are
busy preparing the glibc 2.32 release, and these changes do not alter
the glibc ABI itself, so they do not have immediate priority.  I'm
fairly confident that a version of these changes will make it into glibc
2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat
Enterprise Linux 8.4.  Debian as well, but I have never done anything
like it there, so I don't know if the patches will be accepted.)

Out of the box, this should work fairly well for IBM POWER and Z, where
there is a clear progression of silicon versions (at least on paper
—virtualization may blur the picture somewhat).

However, for x86, we do not have such a clear progression of
micro-architecture versions.  This is not just as a result of the
AMD/Intel competition, but also due to ongoing product differentiation
within one chip vendor.  I think we need these levels broadly for the
following reasons:

* Selecting on individual CPU features (similar to the old hwcaps
  mechanism) in glibc has scalability issues, particularly for
  LD_LIBRARY_PATH processing.

* Developers need guidance about useful targets for optimization.  I
  think there is value in limiting the choices, in the sense that “if
  you are able to test three builds in total, these are the things you
  should build”.

* glibc and the compilers should align in their definition of the
  levels, so that developers can use an -march= option to build for a
  particular level that is recognized by glibc.  This is why I think the
  description of the levels should go into the psABI supplement.

* A preference order for these levels avoids falling back to the K8
  baseline if the platform progresses to a new version due to
  glibc/kernel/hypervisor/hardware upgrades.

I'm including a proposal for the levels below.  I use single letters for
them, but I expect that the concrete implementation of this proposal
will use names like “x86-100”, “x86-101”, like in the glibc patch
referenced above.  (But we can discuss other approaches.)

I looked at various machines in the Red Hat labs and talked to Intel and
AMD engineers about this, but this concrete proposal is based on my own
analysis of the situation.  I excluded CPU features related to
cryptography and cache management, including hardware transactional
memory, and CPU timing.  I assume that we will see some of these
features being disabled by the firmware or the kernel over time.  That
would eliminate entire levels from selection, which is not desirable.
For cryptographic code, I expect that localized selection of an
optimized implementation works because such code tends to be isolated
blocks, running for dozens of cycles each time, not something that gets
scattered all over the place by the compiler.

We previously discussed not emitting VZEROUPPER at later levels, but I
don't think this is beneficial because the ABI does not have
callee-saved vector registers, so it can only be useful with local
functions (or whatever LTO considers local), where there is no ABI
impact anyway.

I did not include FSGSBASE because the FS base is already available at
%fs:0.  Changing the FS base in userspace breaks too much, so the main
benefit is the tighter encoding of rdfsbase, which seems very slim.

Not covered in this are tuning decisions.  I think we can benefit from
some variance in this area between implementations; it should not affect
correctness.  32-bit support is also a separate matter.

* Level A

CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3

This is one step above the K8 baseline and corresponds to a mainline CPU
model ca. 2008 to 2011.  It is also implemented by recent-ish
generations of Intel Atom server CPUs (although I haven't tested the
latest version).  A 32-bit variant would have to list many additional
CPU features here.

* Level B

AVX, plus everything in level A.

This step is so small that it probably can be dropped, unless the
benefits from using VEX encoding are truly significant.

For AVX and some of the following features, it is assumed that the
run-time selection takes full support coverage (from silicon to the
kernel) into account.

* Level C

AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B.

This is close to what glibc currently calls "haswell".

* Level D

AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in
level C.

This is the AVX-512 level implemented by Xeon Scalable Processors, not
the Xeon Phi variant.

glibc (or an alternative loader implementation) would search for
libraries starting at level D, going back to level A, and finally the
baseline implementation in the default library location.

I expect that some distributions will also use these levels to set a
baseline for the entire distribution (i.e., everything would be built to
level A or maybe even level C), and these libraries would then be
installed in the default location.

I'll be glad if I can get any feedback on this proposal.  I plan to turn
it into a merge request for the x86-64 psABI document eventually.

Thanks,
Florian