Transitioning SSE/AVX code to NEON with SIMDe
Now that Apple has announced that they will be moving away from x86 to their own ARM-based CPUs, lots of people will be stuck with SIMD code targeting x86 ISA extensions like SSE, SSE2, AVX, etc., which won’t run on Apple’s new machines.
Arm CPUs do have support for SIMD, but instead of Intel technologies like SSE and AVX, Arm has NEON. NEON is an improvement over the x86 APIs in a lot of ways, and a regression in others, but it is undeniably different and you can’t just recompile your application on Arm and expect it to work.
Or can you? SIMD Everywhere (SIMDe) provides fast, portable, permissively-licensed (MIT) implementations of the x86 APIs which allow you to run code designed for x86/x86_64 CPUs pretty much anywhere, including on Arm (using NEON if available). With almost no source code changes, you can recompile your x86 SIMD code for Arm (or POWER, or WebAssembly, etc.).
If NEON is available, SIMDe will even use it to provide the x86
functions. For example, _mm_add_ps
from SSE can be implemented
using NEON’s vaddq_ps
function, so that’s exactly what SIMDe does.
For more complicated functions without direct analogs in NEON SIMDe
will use the fastest implementation we can. Hopefully that means
calling multiple NEON functions, but in the worst case scenario SIMDe
has completely portable C99 fallbacks.
If you’d like to take SIMDe for a test drive, it is usable on Compiler Explorer. Compilation is a bit slow, due to having to transfer large files, but it’s quite usable.
Before I continue, it’s worth noting that SIMDe has an active chat room, a less-active mailing list, and a very active issue tracker where questions are welcome. If you have any questions, problems, concerns, etc., please get in touch!
Getting SIMDe
I mentioned earlier that “almost no source code changes” are required. Many of you are probably worried about the word “almost”, so let’s discuss that for a bit.
First, you’ll need to get SIMDe. If you’re on Debian there is a libsimde-dev package, or on Fedora/Red Hat/etc. there is a simde package. Both are pretty new, though, so they may not be available to you yet.
If that doesn’t work for you, you can drop a copy of SIMDe into your project. If you want to use a git submodule that will work, but the main repository is pretty big thanks to all the tests. If you want something a bit smaller we also have a simde-no-tests repository which is basically a mirror of only the implementations that is updated automatically whenever SIMDe is updated.
SIMDe is a header-only library, and doesn’t require any build system
integration; simply including the relevant headers is enough. That
said, we do recommend aggressive optimizations (like -O3
), and
enabling OpenMP SIMD (which does not introduce a run-time dependency
on OpenMP) with -fopenmp-simd
on GCC and clang, or -qopenmp-simd
on ICC. If you do enable OpenMP SIMD, please let SIMDe know by also
passing -DSIMDE_ENABLE_OPENMP
(not necessary if you enable full
OpenMP, i.e. -fopenmp
instead of -fopenmp-simd
).
Source-level changes
As far as source-level changes are concerned, all you need to do is
define the SIMDE_ENABLE_NATIVE_ALIASES
macro, and include a SIMDe
header instead of *mmintrin.h. SIMDe headers are named according to
the ISA extension they supply, so you don’t need to remember which
letter corresponds to which ISA extension when writing your code, but
if you already have *mmintrin.h scattered around here is how they map
to SIMDe headers:
- mmintrin.h → simde/x86/mmx.h
- xmmintrin.h → simde/x86/sse.h
- emmintrin.h → simde/x86/sse2.h
- pmmintrin.h → simde/x86/sse3.h
- tmmintrin.h → simde/x86/ssse3.h
- smmintrin.h → simde/x86/sse4.1.h
- nmmintrin.h → simde/x86/sse4.2.h
Starting with AVX Intel started using immintrin.h to just include everything, so if you’re using immintrin.h just include the header for the “greatest” ISA extension you use; for example, if you want AVX-512F, include simde/x86/avx512f.h.
Let’s take a look at that SIMDE_ENABLE_NATIVE_ALIASES
macro. If you
don’t define it, SIMDe will only define functions in its own simde_*
namespace. For example, instead of _mm_add_ps
you would need to use
simde_mm_add_ps
. If you do define SIMDE_ENABLE_NATIVE_ALIASES
,
SIMDe will also use a function-like macro to create an alias:
#define _mm_add_ps(a, b) simde_mm_add_ps(a, b)
While that works most of the time, there are a few things you’ll
want to be aware of. Perhaps the biggest problem is that Intel
doesn’t use fixed-width types (int8_t
, int32_t
, uint16_t
, etc.)
in their APIs, they instead assume specific characteristics of
standard types which are true on x86 but may not be true on other
platforms. For example, on many Arm platforms, char
is unsigned,
but Intel uses char
to represent a signed 8-bit integer.
SIMDe deals with this by using fixed-width types in our
implementations so they work eveywhere, but if your code is using
char
to mean signed 8-bit integer
you may encounter problems when
attempting to use SIMDe functions on some platforms. The good news is
that you can generally just change your code to use int8_t
instead
of char
; it will work exactly the same on x86 (int8_t
is likely
just a typedef to char
), and it will also work on other platforms.
Completeness
SIMD APIs are big. Very big. x86/x86_64 alone currently has a bit over 6,000 functions, of which SIMDe has implemented around 2,000.
That said, most of those are AVX-512F extensions which aren’t widely used yet. Odds are quite good that you’re only using extensions for which SIMDe already has complete support (as of v0.5.0, released 2020-06-22):
We also have a very good start on many other extensions, including AVX2, AVX-512F, AVX-512BW, AVX-512VL, and NEON (portable implementations of NEON that can run on x86, or anywhere else). Also, it’s not really a CPU extension, but our implementation of SVML is coming along nicely.
If SIMDe is missing a particular function you need, please file an issue and we may be able to prioritize an implementation. We’re planning to implement all functions anyways, and doing so in a slightly different order doesn’t generally create any extra work, so if it would help your project we’re generally happy to oblige. Of course, if you’re interested in implementing something yourself instead of waiting for us patches are always welcome!
Debugging
SIMDe can be a fantastic tool for debugging. Not only can you see inside of the function to understand how it really works, you can also run the code on your development machine in your native environment without an emulator. Obviously you’ll eventually want to check everything at least in an emulator, or preferably on real hardware, but during development SIMDe can be immensely helpful.
Performance
Honestly, it’s pretty good. When there is a NEON function that implements exactly the same functionality there is no cost for using SIMDe instead of calling the NEON function directly; the compiler translates it to exactly the same code.
Even when we hit a portable fallback, the compiler is often smart
enough to auto-vectorize the code, especially if you have aggressive
optimizations (think -O3
) enabled. We use compiler-specific
functionality like GCC-style vector
extensions
(supported by pretty much every compiler except for MSVC), builtins
like
__builtin_shufflevector
,
__builtin_shuffle
,
__builtin_convertvector
,
etc. wherever possible, which generaly results in optimal
implementations. Even when we hit portable fallbacks, they are
decorated with pragmas from OpenMP 4 SIMD, Cilk+, or copmiler-specific
hints like GCC loop-specific
pragmas
or clang pragma loop hint
directives.
We try very hard to make sure that even the fallbacks are fast.
SIMDe will never make your project slower, only more portable. Performance likely won’t be as good as a manual rewrite by someone who knows NEON well, but you can get a port up and running at almost no cost in terms of developer resources, and once it’s done you’re free to mix, for example, SSE and NEON code at will. That means you can gradually port specific portions of your code which are particularly hot, or where SIMDe doesn’t do a good job (though in that case please file an issue too), while leaving areas where SIMDe performance is adequate alone instead of wasting development time and resources.
Still have questions?
The F.A.Q. has some information which may help.
If that doesn’t answer your question please feel free to ask in our chat room, on our mailing list, or on our issue tracker; if you have questions our documentation hasn’t answered, it’s a bug in our documentation, so don’t worry about using the issue tracker!