[llvm-dev] Proposal to remove MMX support.

Sun Aug 30 16:10:56 PDT 2020

I recently diagnosed a bug in someone else's software, which turned out to
be due to incorrect MMX intrinsics usage: if you use any of the x86
intrinsics that accept or return __m64 values, then you, the *programmer* are
required to call _mm_empty() before using any x87 floating point
instructions or leaving the function. I was aware that this was required at
the assembly-level, but not that the compiler forced users to deal with
this when using intrinsics.

This is a real nasty footgun -- if you get this wrong, your program
doesn't crash -- no, that would be too easy! Instead, every x87 instruction
will simply result in a NaN value.

Even more unfortunately than all that, it is currently impossible to
correctly use _mm_empty() to resolve the problem, because the compiler has
no restrictions against placing x87 FPU operations between an MMX
instruction and the EMMS instruction.

Of course, I didn't discover any of this -- it was already
well-known...just not to me. But let's actually fix it.

*Existing bugs*:
llvm.org/PR35982 <https://bugs.llvm.org/show_bug.cgi?id=35982> --
POSTRAScheduler disarrange emms and mmx instruction
llvm.org/PR41029 <https://bugs.llvm.org/show_bug.cgi?id=41029> -- The __m64
not passed according to i386 ABI
llvm.org/PR42319 <https://bugs.llvm.org/show_bug.cgi?id=42319> -- Add pass
to insert EMMS/FEMMS instructions to separate MMX and X87 states
llvm.org/PR42320 <https://bugs.llvm.org/show_bug.cgi?id=42320> -- Implement
MMX intrinsics with SSE equivalents

*Proposal*
We should re-implement all the currently-MMX intrinsics in Clang's
*mmintrin.h headers by using the existing SSE/SSE2 compiler builtins, on
both x86-32 and x86-64, and then *delete the MMX implementation of these
intrinsics*. We would thus stop supporting the use of these intrinsics,
without SSE2 also enabled. I've created a preliminary patch for these
header changes, https://reviews.llvm.org/D86855.

Sometime later, we should then remove the MMX intrinsics in LLVM IR. (Only
the intrinsics -- the machine-instruction and register definitions for MMX
should be kept indefinitely for use by assembly code.) That raises the
question of bitcode compat. Maybe we do something to prevent new use of the
intrinsics, but keep the implementations around for bitcode compatibility
for a while longer?

We might also consider defaulting to -mno-mmx for new compilations in
x86-64, which would have the additional effect of disabling the "y"
constraint in inline-asm. (MMX instructions could still exist in the
binary, but they'd need to be entirely contained within an inline-asm blob).

Unfortunately, given the ABI requirement in x86-32 to use MMX registers for
8-byte-vector arguments and returns -- which we've been violating for 7
years -- we probably cannot simply use -mno-mmx by default on x86-32.
Unless, of course, we decide that we might as well just continue violating
the ABI indefinitely. (Why bother to be correct, after the better part of a
decade being incorrect...)

*Impact*
- No more %mm* register usage on x86-64, other than via inline-asm. No more
%mm* register usage on x86-32, other than inline-asm and when calling a
function that takes/returns 8-byte vectors (assuming we fix the
ABI-compliance issue).
- Since the default compiler flags include SSE2, most code will switch to
using SSE2 instructions instead of MMX instructions when using intrinsics,
and continue to compile fine. It'll also likely be faster, since MMX
instructions are legacy, and not optimized in CPUs anymore.
- Code explicitly disabling SSE2 (e.g. -mno-sse2 or -march=penium2) will
stop compiling if it requires MMX intrinsics.
- Code using the intrinsics will run faster, especially on x86-64, where
the vectors are passed around in xmm registers, and is being copied to mm
registers just to run a legacy instruction. But even without that, the mmx
instructions also just have less throughput than the sse2 variants on
modern CPUs.

*Alternatives*
We could keep both implementations of the functions in mmintrin.h, in order
to preserve the ability to use the intrinsics when compiling for a CPU
without SSE2.

However, this doesn't seem worthwhile to me -- we're talking about dropping
the ability to generate vectorized code using compiler intrinsics for Intel
Pentium MMX, Pentium, II, and Pentium III (released 1997-1999), as well as
AMD K6 and K7 series chips of around the same timeframe.

We could also keep the clang headers mostly-unmodified, and make the llvm
IR builtins themselves expand to SSE2 instructions. I believe GCC has
effectively chosen this option. That seems less desirable; it'll be more
complex to implement at that level, versus in the headers, and doesn't
leave a path towards eliminating the builtins in the future.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200830/f31f1d37/attachment.html>