I’m pleased to announce the availability of the latest releases of SIMD Everywhere (SIMDe), version 0.8.0 and version 0.8.2, representing another year of work by over 20 contributors since version 0.7.6.

Request for help: SIMDe has only one maintainer (@mr-c)! Please inquire about assisting in new work, code review, and more.

SIMDe is a permissively-licensed (MIT) header-only library which provides fast, portable implementations of SIMD intrinsics for platforms which aren’t natively supported by the API in question.

For example, with SIMDe you can use SSE, SSE2, SSE3, SSE4.1 and 4.2, AVX, AVX2, and many AVX-512 intrinsics on ARM, POWER, WebAssembly, or almost any platform with a C compiler. That includes, of course, x86 CPUs which don’t support the ISA extension in question (e.g., calling AVX-512F functions on a CPU which doesn’t natively support them).

If the target natively supports the SIMD extension in question there is no performance penalty for using SIMDe. Otherwise, accelerated implementations, such as NEON on ARM, AltiVec on POWER, WASM SIMD on WebAssembly, etc., are used when available to provide good performance.

SIMDe is not just about implementing Intel/AMD intrinsics, it also has implementations for 99% of the ARM NEON intrinsics and in-progress support for others.

SIMDe has already been used to port several packages to additional architectures through either upstream support or distribution packages, particularly on Debian.

What’s new in 0.8.0 / 0.8.2

  • 99% complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in version 0.7.6! (@yyctw @wewe5215
  • Start of RISCV64 optimized implementation using the RVV1.0 vector extension! Thank you @eric900115 @howjmay @zengdage.
  • SIMDe PRs are tested using Fedora Rawhide (@junaruga)

As always, we have an extensive test suite to verify our implementations.

For a complete list of changes, check out the 0.8.0 and 0.8.2 release notes.

Below are some additional highlights:

X86

There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).

Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER, PF, 4MAPS, and 4VNNIW) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.

Newly added function families

  • AES: 5 of 6 (83.33%)

    Newly AVX512 added function families

  • castph: 1 of 9 (11.11%) implemented.
  • cvtus_storeu: 1 of 18 (5.56%) implemented.
  • fpclass: 3 of 24 (12.50%) implemented.
  • i32gather: 1 of 8 (12.50%) implemented.
  • i64gather: 8 of 8 :100:
  • permutex: 3 of 12 (25.00%) implemented.
  • rcp14: 1 of 24 (4.17%) implemented. reduce
  • reduce_max: 7 of 31 (22.58%) implemented.
  • reduce_min: 7 of 31 (22.58%) implemented.
  • shufflehi: 1 of 7 (14.29%) implemented.
  • shufflelo: 1 of 7 (14.29%) implemented.

    Additions to existing families

  • AVX512BW: 7 additional, 337 of 790 (42.66%)
  • AVX512DQ: 5 additional, 112 total of 376 (29.79%)
  • AVX512F: 48 additional, 1087 total of 2812 (38.66%)
  • AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)

    Neon

    SIMDe currently implements 6608 out of 6670 (99.07%) NEON functions; up from 56.46% in the previous release!

    Newly added families

  • abal
  • abal_high
  • abd
  • abdh
  • abdl_high
  • addhn_high
  • aes
  • bfdot
  • bfdot_lane
  • cadd_rot
  • cale
  • calt
  • cmla_lane
  • cmla_rot_lane
  • copy_lane
  • cvt_high
  • cvt_n
  • cvta
  • cvtn
  • cvtp
  • cvtx
  • cvtx_high
  • div
  • dupb_lane
  • duph_lane
  • eor3
  • fmlal
  • fms
  • fms_lane
  • fms_n
  • ld2_dup
  • ld2_lane
  • ld3_dup
  • ld3_lane
  • ld4_dup
  • maxnmv
  • minnmv
  • mla_lane
  • mla_high_lane
  • mls_lane
  • mlsl_high_lane
  • mmla
  • mull_high_lane
  • mull_high_n
  • mulx
  • mulx_lane
  • pmaxnm
  • pminnm
  • qdmlal
  • qdmlal_high
  • qdmlal_high_lane
  • qdmlal_high_n
  • qdmlal_lane
  • qdmlal_n
  • qdmlsl
  • qdmlsl_high
  • qdmlsl_high_lane
  • qdmlsl_high_n
  • qdmlsl_lane
  • qdmlsl_n
  • qdmlslh
  • qdmlslh_lane
  • qdmulhh
  • qdmulhh_lane
  • qdmull_high
  • qdmull_high_lane
  • qdmull_high_n
  • qdmull_lane
  • qdmull_n
  • qdmullh_lane
  • qmovun_high
  • qrdmlah
  • qrdmlah_lane
  • qrdmlahh
  • qrdmlahh_lane
  • qrdmlsh
  • qrdmlsh_lane
  • qrdmlshh
  • qrdmlshh_lane
  • qrdmulhh_lane
  • qrshl
  • qrshlh
  • qrshrn_high_n
  • qrshrnh_n
  • qrshrun_high_n
  • qrshrunh_n
  • qshl_n
  • qshlh_n
  • qshluh_n
  • qshrn_high_n
  • qshrnh_n
  • qshrun_high_n
  • qshrunh_n
  • raddhn
  • raddhn_high
  • rax
  • recp
  • rnd32x
  • rnd32x
  • rnd32x
  • rnd64z
  • rnda
  • rndx
  • rshrn_high_n
  • rsubhn
  • rsubhn
  • set_lane
  • sha1
  • sha1h
  • sha256
  • sha512
  • shll_high_n
  • shrn_high_n
  • sli_n
  • sm3
  • sm4
  • sqrt
  • st1_x2
  • st1_x3
  • st1_x4
  • st1q_x2
  • st1q_x3
  • st1q_x4
  • subhn_high
  • sudot_lane
  • usdot
  • usdot_lane

Finally complete families

  • cvtn
  • mla_lane

Getting Involved

If you’re interested in using SIMDe but need some specific functions to be implemented first, please file an issue and we may be able to prioritize those functions.

If you’re interested in helping out please get in touch. We have a chat room on Matrix/Element if you have questions, or of course you can just dive right in on the issue tracker.