Release Notes

Here we draft the release notes for the next release.

Note: format is [summary] [commit hash or PR#] [author(s)]

Use the release notes helper script to generate the preliminary list. Then group the changes and review the descriptions and look out for ????

Mostly the first line of the commit line is a good summary, but please think through each entry and (re)write a summary that helps users quickly determine if this change would be interesting/useful to them. For example, include the name of the intrinsic/function in the summary so that users don't have to click through each commit themselves.

SIMDe 0.7.4

Summary

Minimum meson version is now 0.54
Initial support for x86 F16C API
Initial support for MIPS MSA API
Initial support for Arm Scalable Vector Extensions (SVE) API
Initial support for WASM SIMD128 API
Initial support for the E2K (Elbrus) architecture

X86

There are a total of 7470 SIMD functions on x86, 2971 (39.77%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5270 functions currently in AVX-512, SIMDe implements 1439 (27.31%)

Newly added function families

AVX512CD: 21 of 42 (50.00%)
AVX512VPOPCNTDQ: 18 of 18 💯
AVX512_4VNNIW: 6 of 6 (100.00%)
AVX512_BF16: 9 of 38 (23.68%)
AVX512_BITALG: 24 of 24 💯
AVX512_FP16: 2 of 1105 (0.18%)
AVX512_VBMI2 3 of 150 (2.00%)
AVX512_VNNI: 36 of 36 💯
AVX_VNNI: 8 of 16 (50.00%)

Additions to existing families

AVX512F: 579 additional, 856 total of 2660 (31.80%)
AVX512BW: 178 additional, 335 total of 828 (40.46%)
AVX512DQ: 77 additional, 111 total of 399 (27.82%)
AVX512_VBMI: 9 additional, 30 total of 30 💯
KNCNI: 113 additional, 114 total of 595 (19.16%)
VPCLMULQDQ: 1 additional, 2 total of 2 💯

Neon

SIMDe currently implements 3745 out of 6670 (56.15%) NEON functions. If you don't count 16-bit floats and poly types, it's 3745 / 4969 (75.37%).

Details

Implementation of Arm intrinsics

NEON

aarch64 + clang-1[345] fix for "implicit conversion changes signedness" a22c3cc @mr-c
neon: Implement f16 types 21496f6 @Glitch18
neon: port additional code to new style 1c744fd @nemequ
neon: replace some more abs/labs/llabs usage with simde_math_* versions c59853a @nemequ
neon: refactor to use different types on all targets c17957a @nemequ
neon: test for MMX/SSE instead of x86 when choosing implementation 0366dab @nemequ
neon/abd: add much better implementations c3ddbbe @nemequ 220db33 @ngzhian
neon/abs: add SSE2 integer abs implementations 6396dc8 @aqrit
neon/addhn: initial implementation e9ee066 @nemequ
neon/add: Implement f16 functions e69239c @Glitch18
neon/{add,sub}w_high: use vmovl_high instead of vmovl + get_high b897331 @nemequ
neon/bcax: initial implementation 96ce481 0ed3dea @Glitch18
neon/bsl: Implement f16 functions edb75b5 @Glitch18
neon/cage: Initial f16 implementations 20df81d @Glitch18
neon/cagt: Implement f16 functions 452a6d3 @Glitch18
neon/ceq: Implement f16 functions f24ab3d @Glitch18
neon/ceqz: Implement f16 functions dd2ebf2 de301cd @Glitch18
neon/cge: Implement f16 functions a512986 f3ad0d4 647dc12 @Glitch18
neon/cgez: complete implementation of CGEZ family 6d86a20 @Glitch18
neon/cgt: Add implementation of remaining functions 9930c43 @Glitch18
neon/cgt, simd128: improve some unsigned comparisons on x86 ae6702a @nemequ
neon/cgtz: Add implementations of remaining functions 4d749b5 @Glitch18
neon/cle: add some x86 implementations 5906cc9 d81c7e7 @nemequ 7894c7d @Glitch18
neon/clez: Add implementaions of scalar functions bc72880 @Glitch18
neon/clt: Add implementations of scalar functions & SSE/AVX512 fallbacks bc636e1 6a19637 @Glitch18
neon/cltz: Add scalar functions and natural vector fallbacks 2960ef0 @Glitch18
neon/cmla, neon/cmla_rot{90,180,270}: check compiler versions e98152f @nemequ
neon/cmla, neon/cmla_rot{90,180,270}: CMLA requires armv8.3+ 280faae @nemequ
neon/cmla, neon/cmla_rot{90,180,270}, neon/fma: initial implementation 2aff4f9 @Glitch18
neon/cnt: add x86 implementations of vcntq_s8 a558d6d @nemequ
neon/cvt: add __builtin_convertvector implementations d06ea5b @nemequ
neon/cvt: add out-of-range and NaN tests 7d0e2ac @nemequ
neon/cvt: add some faster x86 float->int/uint conversions ceaaf13 @nemequ
neon/cvt: Add vcvt_f32_f64 and vcvt_f64_f32 implementations 8398f73 @Glitch18
neon/cvt: cast result of float/double comparison dc215cd @ngzhian
neon/cvt: disable some code on 32-bit x86 which uses _mm_cvttsd_si64 48edfa9 @nemequ
neon/cvt: don't use vec_ctsl on POWER 8f9582a @nemequ
neon/cvt: fix a couple of s390x implementations' NaN handling a8bd33d @nemequ
neon/cvt: fix compilation with -ffast-math d1d070d @nemequ
neon/cvt: Implement f16 functions b6a9882 @Glitch18
neon/cvt, relaxed-simd: add work-around for GCC bug #101614 11aa006 @nemequ
neon/cvt, simd128: fix compiler errors on PPC 965e68e @nemequ
neon/cvt: clang bug 46844 was fixed in clang 12.0 71e03a6 @mr-c
neon/dot_lane: add remaining implementation 3f1c1fa 4a9ca8a @Glitch18
neon/dup_lane: Complete implementation of function family 12fb731 df320d1 @Glitch18 014ee00 9461557 @nemequ
neon/dup_lane: use dup_n 2b4a009 @ngzhian
neon/dup_n: Implement f16 functions 14fdf88 @Glitch18
neon/dup_n: replace remaining functions with dup_n implementations 27a13b0 @nemequ
neon/dupq_lane: native and portable 893db57 @ngzhian
neon/ext: add __builtin_shufflevector implementation de8fe89 @ngzhian
neon/ext: add _mm_alignr_{,e}pi8 implementations 6d28f04 @nemequ
neon/ext: clean up shuffle-based implementation f1de709 @nemequ
neon/fma: add a couple x86 and PPC implementations 7a2860b @nemequ
neon/fma: add more extensive feature checking e541dd1 @nemequ
neon/fma_lane: Implement fmaq_lane functions a77e6ad 555ef3e @Glitch18
neon/fma_n: initial implementation 06d5a62 @nemequ dab4342 @nemequ
neon/get_high: add __builtin_shufflevector optimizations 4003afa @ngzhian
neon/get_low: use __builtin_shufflevector if available ea3f75e @ngzhian
neon/hadd,hsub: optimization for Wasm ebe09d8 @ngzhian
neon/ld1: add Wasm SIMD implementation a79bc15 @ngzhian
neon/ld1_dup: native and portable (64-bit vectors), f64 debb3c8 @ngzhian 6c71aac @Glitch18
neon/ld1_dup: split from ld1, dup_n fallbacks, WASM implementations 4c586e0 @nemequ
neon/ld1: Implement f16 functions 6e89a9c f26f775 @Glitch18
neon/ld1_lane: Implement remaining functions de2de8d @Glitch18 9051a51 @ngzhian
neon/ld1q: u8_x2, u8_x3, u8_x4 341006c @ngzhian
neon/ld1[q]_*_x2: initial implementation cd14634 @dgazzoni
neon/ld{2,3,4}: disable -Wmaybe-uninitialized on all recent GCC e142a59 @nemequ
neon/ld{2,3,4}: silence false positive diagnostic on GCC 7 3f737a3 @nemequ
neon/ld2: Implement remaining functions e68f728 @Glitch18 3b3014f @ngzhian 078bb00 @nemequ 041b1bd @mr-c
neon/ld4_lane: native and portable implementations a973cab @ngzhian 179fb79 @Glitch18 0d1ab79 @nemequ
neon/ld4: use conformant array parameters 723a8a8 @nemequ
neon/ld4: work around spurious warning on clang < 10 64e9db0 @nemequ
neon/min: add SSE2 vminq_u32 & vqsubq_u32 implementation 2cf165e 117de35 @nemequ
neon/{min,max}nm: add some headers for -ffast-math ebe5c7d @nemequ
neon/{min,max}nm: use simde_math_* prefixed min/max functions c1607d2 @nemequ
neon/mlal_high_n: initial implementation d6f75fa @dgazzoni
neon/mlal_lane: initial implementation 82e36ed 2168ca0 @nemequ
neon/mls: add _mm_fnmadd_* implementations of vmls*_f* 70e0c20 @nemequ
neon/mlsl_high_n: initial implementation ca1a4c3 @dgazzoni
neon/mlsl_lane: initial implementation de78ae9 @nemequ
neon/mls_n: initial implementation 042c6eb @nemequ
neon/movl: improve WASM implementation ccffc23 @nemequ
neon/mul: add improved SSE2 vmulq_s8 implementation c6c6361 @nemequ
neon/mul: implement unsigned multiplication using signed functions 979552a @nemequ
neon/mul_lane: Add mul_laneq functions 86b039c 5d2e4bc @Glitch18
neon/mull_lane: initial implementation 4dd488d @nemequ
neon/neg: Complete implementation of function family 6423a26 @Glitch18
neon/padd: Add scalar function implementations fe21dc1 @Glitch18
neon/pmax: Add scalar function implementations a287eaa @Glitch18
neon/pmin: Add scalar function implementations 38f7499 @Glitch18
neon/qabs: add some faster implementations 6cd925e @nemequ
neon/qadd: add several improved x86 and vector extension versions 4e48e5c @nemequ
neon/qadd: fix warning in ternarylogic call in vaddq_u32 fad2470 @nemequ
neon/qadd: improve SSE implementation 8fbe7cd @nemequ
neon/qdmulh: add scalar & shuffle-based implementations 8cf3afc @nemequ 68e7a0e @Glitch18
neon/qdmulh_lane: native and portable 79dc1ee @ngzhian 1c64794 @Glitch18
neon/qdmulh_n: native and portable implementations 55a9c07 @ngzhian
neon/qdmull: add WASM implementations 7d7a43b @nemequ
neon/qrdmulh_lane: initial implementation dc2ea75 @nemequ 3794620 @ngzhian 9ab1446 @Glitch18
neon/qrdmulh: native aliases for scalar functions should be A64 f7820fc @nemequ
neon/qrdmulh: steal WASM q15mulr_sat implementation for qrdmulhq_s16 ccacf94 @nemequ
neon/qrshrn_n: Add scalar, native and portable function implementations ffa09ca @Glitch18 2595b3e @ngzhian
neon/qrshrun_n: Add scalar, native and portable function implementations 49300fa @Glitch18 d5e805b @ngzhian
neon/qshlu_n: initial implementation 77af9f1 f7b59a5 @Glitch18
neon/qshrn_n: initial implementation d9260dc @nemequ b4eed3e @Glitch18
neon/qshrun_n: native, scalar, and portable implementations c29f9fb @ngzhian eeaad75 @Glitch18
neon/qsub: add some SSE and vector extension implementations 1cb520a @nemequ
neon/recpe: recpe_f32 and recpe_f64, native and portable 629d129 5a27732 @ngzhian eb18b7c @nemequ 9d8e77f @Glitch18
neon/recps: recps/recpsq for native, scalar, and portable e8a8a09 7e420a1 @ngzhian 9c67d34 @Glitch18
neon/reinterpret: f16_u16 and u16_f16 implementations 9aedd5d @Glitch18 7f9794a @ngzhian
neon/rhadd: optimizations for rhaddq_xxx f730009 @aqrit
neon/rndi, sse2: work around several functions missing in GCC 0b6a9c1 @nemequ
neon/rndn: Add scalar function implementation d5d6509 d01618a 90c910b @Glitch18 050f935 @nemequ
neon/rshl: Add scalar function implementations c641cbd @Glitch18
neon/rshr_n: Add scalar function implementations 465c1ec 3a0ef81 @Glitch18
neon/rshrn_n: native and portable implementations a703711 @ngzhian
neon/rsqrte: Implement remaining functions 75c1495 @Glitch18 990b458 @nemequ 8781eb6 @ngzhian
neon/rsqrts: vrsqrts_f32 and vrsqrtsq_f32 native and portable de8c592 @ngzhian ed5e971 @Glitch18
neon/rsra_n: Add scalar function implementations 4944075 @Glitch18
neon/shl: Add scalar implementations 89fdad8 @Glitch18
neon/shll_n: native and portable implementations 98ac861 @ngzhian
neon/shl_n: Add scalar function implementations 267ab66 @Glitch18
neon/shlu_n: faster WASM implementations 5576d8a @nemequ
neon/shr_n: Add scalar function implementations e3e4b8e @Glitch18 e751352 @nemequ
neon/shrn_n: s16 s32 s64 u16 u32 u64 portable, native, WASM 8810cdd @ngzhian 40b4549 @ngzhian
neon/sqadd: initial implementation eab9d99 @Glitch18 1c0dabf @nemequ
neon/sra_n: Add scalar function implementations 272c2cf @Glitch18
neon/sri_n: add 128-bit, native, portable & scalar implementations aa832e1 @nemequ dcbcab5 @Glitch18 f6cf839 @ngzhian
neon/st1: Add f16 functions f58bd3c @Glitch18
neon/st2: Implement remaining functions 43c4b52 @Glitch18
neon/st2_lane: portable and native for 8ee1eb4 @ngzhian 4cbed4a @Glitch18
neon/st2,st1: use zip + st1 to implement st2 7929406 @ngzhian
neon/st2: vst2(q) f32 s8 s16 s32 u8 u16 u32 1e38dcb @ngzhian
neon/st3: Add shuffle vector implementations 52da8d4 @Glitch18
neon/st3_lane: portable and native ae308b2 @ngzhian 982d2a9 @Glitch18
neon/st3q_u8: Wasm optimization 687460c @ngzhian
neon/st4_lane: portable and native b231820 @ngzhian 5be1b07 @Glitch18
neon/subhn: initial implementation ca62754 @nemequ
neon/sub: Implements the two remaining scalar functions 74e5b82 @Glitch18
neon/subl_high: initial implementation 36d6d11 @dgazzoni
neon/tbl: add WASM implementation of vtbl1_u8 d05fa59 @nemequ
neon/tst: implement scalar functions 41c2f8a @Glitch18
neon/types: remove duplicate NEON float16_t definitions 7f40f35 @dgazzoni
neon/types: reverse logic for SIMDE_ARM_NEON_FORCE_NATIVE_TYPES 7776a8c @nemequ
neon/types: use vector extensions for public types when available 790e263 @nemequ
neon/vdup: vdupq_lane_f32 native and portable e2ae5dc @ngzhian
neon/vld1q_dup: native and portable implementations 650d531 @ngzhian
neon/vld2_u8: native and portable implementation 85d2ed2 @ngzhian
neon/vld2: vld2_{u16,u32} and vld2q_{u8,u16,u32,f32} b43d434 @ngzhian
neon/vld4: Wasm optimization of vld4q_u8 07387bf @ngzhian
neon/vmovq: define vmovq_n as aliases for vdup_n ff7472b @ngzhian
neon/xar: initial implementation 50cd8af @Glitch18
neon/zip1: add armv7 implementations d4ded0a @nemequ

SVE Intrinsics

Initial import of a portable SVE implementation. f8f8382 9fd7d68 7311dd3 @nemequ
sve/add: initial implementation 70d5b0a 21b39aa 747e076 dd42b49 971aefb @nemequ
sve/and: initial implementation 5c56617 3382f4e @nemequ
sve/cmplt: replace vec_and with & for s390 implementations 7c599ea @nemequ
sve/dup: add *_m variants b90ae4d bad00e9 1da79a2 @nemequ
sve/ptest: simplify svptest_first c7e4699 @nemequ
sve/qadd: initial implementation 8aaa62b @nemequ
sve/sel: initial implementation 113ec2b a1e423e @nemequ
sve/types: add mmask4 functions for 256-bit vectors 33fbaa2 @nemequ
sve/whilelt: add svwhilelt_*_{u32,s64,u64} implementations 36927be 2b29fef @nemequ

WASM intrinsics

Add WebAssembly SIMD128 implementation. db758eb 20664a6 57efb02 20682c1 804b833 65db4cf bdc8698 271d1e4 631cf53 7078ab4 5c8d7b3 0e43903 c734535 34b775d 22609d4 f4ee32a 516eb02 1d4075c f73db2d c66df66 c2fda16 06b3462 d45f735 b7b69fb 8a748d7 6c57794 e60f1e0 c37dfd3 fdfa16a c4aa8b4 96226ff 732f519 2890ad4 706de03 fca719e 5638afa d013847 3d4b2ff 783c752 3378ab3 42f0a0b e8da237 22c0dee d9e3615 9848a4c 8a21137 5b1a330 dbd2e5c 09d8f79 e1bc968 @nemequ 2380aa4 @coderzh
wasm: load lane memcpy instead of cast to address UBSAN issues 7631312 @wrv
wasm: f32x4 and f64x2 nearest roundeven dc75f4c @wrv
relaxed-simd: initial support for the WASM relaxed SIMD proposal 083bd2f 3e5515a bf136e7 48954b6 9715924 @nemequ
clang wasm: add workaround to fix wasm_i64x2_shl bug 256d9df @Changqing-JING
simd128: clang 13 fixed bugs affecting simde_wasm_{v128_load8_lane,i64x2_load32x2} 7bc774f @mr-c
wasm simd128: correct trunc_sat _FAST_CONVERSION_RANGE target type e861f2c @mr-c

x86 intrinsics

Fix native aliases for amd64-only functions f0e9755 @nemequ
Add @aqrit's SSE2 min/max implementations d90e835 @nemeq
x86: fix AVX native → SSE4.2 native f6fc25a @mr-c
x86: ignore warnings about inefficient functions on lcc 416c243 @makise-homura
The fix for GCC bug #95483 wasn't in a release until 11.2 11d95f8 @nemequ
fix array size wrong size (caught by GCC 12) c6179cb @Lithrein

SSE*

sse: avoid including windows.h when possible 750f20d @boris-kuz
sse: don't use armv7 impl of _MM_TRANSPOSE4_PS on armv8 b5fb757 @nemequ
sse, mmx: fix clang-11 on POWER a0e9f9f @nemequ
sse: prefer SIMDE_SHUFFLE_VECTOR implementation of _mm_shuffle_ps 377e350 @nemequ
sse: replace _mm_prefetch implementation 26d515f @nemequ
sse, sse2: clean up several shuffle macros cc6dc18 @nemequ
sse, sse2: fix vec_cpsign order test 1465c48 @nemequ
sse, sse2: sync clang-12 changes for vec_cpsgn 1ba1596 @simba611
sse, sse2: work around GCC bug #100927 80472b7 @nemequ
sse: use portable implementation to work around llvm bug #344589 79738de @nemequ
sse: Suppress min/max macro definitions from windows.h 3465b57 @quyykk
sse: Fixed simde_mm_prefetch warnings 9c3d0dc @Epixu
sse2: add fast-math WASM implementation of _mm_cvtps_epi32 24c503f @nemequ
sse2: add parenthesis around macro arguments b394520 @nemequ
sse2: correct typos in simde_x_mm_broadcastlow_pd f8ce9bb @rosbif
sse2: don't require constants for _mm_srai_epi{16,32} 8bee92a @????
sse2: fix incompatible argument in A32 impl. of _mm_cvtps_epi32 b5fbe39 @jpcima
sse2: fix set but not used variable in _mm_cvtps_epi32 f460666 @nemequ
sse2: ignore broken _mm_loadu_si{16,32} on GCC 4b7394f @nemequ
sse2: prefer shuffle implementation of _mm_shuffle_epi32 to NEON d2ce706 @nemequ
sse2: remove AArch64 implementation of _mm_movemask_epi8 c595f6b @nemequ
sse2: remove statement expr requirement for NEON srli/srai macros da4d24f @nemequ
sse2, sse4.1: pull in improved packs/packus implementations from WASM 7b1df61 @nemequ
sse2: use simde_math_{add,sub}s_* for mm{add,sub}s_* functions 09d725d @nemequ
sse2: vcvtnq_s32_f32 is armv8-specific 98075d0 @nemequ
sse2: workaround missing vcvtnq_s32_f32 on GCC e11258e @jpcima
sse2: Fixed parameters to _mm_clflush d46f0e7 @thomasdwu
sse2 gcc: bug 99754 was fixed in GCC 12.1 6453f55 @mr-
ssse3: Add SSE2 integer abs implementation 2de8624 @aqrit
sse4.1 _mm_insert_ps: incorrect handling of the control 94e7569 @MirJawadMairaj
sse4.1: add some casts to make clang -Weverything happy 5f000af @nemequ
sse4.1: fix AArch64 implementation of simde_x_mm_blendv_epi64 978d1f7 @milot-mirdita
sse4.1: _mm_blendv_epi8: add sse2 and update wasm_simd128 implementions 2dbc124 @aqrit
sse4.1: remove statement expr dependency in blend functions 01fb894 @nemequ
sse4.1: replace NEON implementations with shuffle-based implementations 29a3cb4 @nemequ
sse4.1: use NEON types instead of vector in insert implementations 489e36c @nemequ
sse4.2: re-enable native _mm_cmpgt_epi64 7117c48 @aqrit
sse4.2: work around more warnings on old clang 3f186a0 @nemequ
sse4.1: fix A32V7 version of _mm_test{nz,}c_si128 e7c70a2 @mr-c

AVX

avx: work around missing _mm256_{load,store}u_m128{,i,d} on LCC a3a39e2 @nemequ
avx: try to detect prior inclusion of AVX header and handle it e8b7a2e @nemequ
avx, avx512/cmp: properly handle NaN in _mm{,256,512}cmp{ps,pd,ss,sd} 491d3fa @nemequ
avx: use internal symbols in clang fallbacks for cmp_ps/pd functions 35b86b7 @nemequ
avx: work around incorrect maskload/store definitions on clang < 3.8 a9313de @nemequ
avx: add native calls for mm256_insertf128{pd,ps,si256} bab30bb @LaurentThomas
avx{,2}: fix maskload illegal mem access 39f723e @k-dominik

AVX2

avx2: add vector/shuffle implementation of _mm256_madd_epi16 2c2dd73 @nemequ
avx2: fix undefs for many native aliases 2ca5480 @anrodrig
avx2: added vector size conditional for unpack 287bda9 @simba611
avx2: separate natural vector length for float, int, and double types 6d1896d @nemequ

AVX512

avx512/{knot,kxor,cmp,cmpeq,compress,cvt,loadu,shuffle,storeu} Additional AVX512{F,BW,VBMI2,VL} ops 1f8d1d2 @mr-c
avx512: work around several bugs in older versions of clang e64231e @nemequ
avx512: add several new functions ccc0757 @anrodrig b3535c3 @nemequ
avx512: implement mm*_mask(z)compress(storeu)* dab908e @simba611
avx512: implement mm_mask(z)_unpack* funcs 7aa3155 @simba611
avx512: initial implementation f35090a @simba611
avx512/4dpwssd: implement complete function family 5bbf50f @simba611
avx512/4dpwssds: initial implementation 22b8b97 @simba611
avx512/abs: add SSE2 implementation of _mm_abs_epi64 5c2f423 @aqrit
avx512/abs: work around buggy pd functions in GCC 7 - 8.2 605c92a @anrodrig
avx512/bitshuffle: initial implementation c92a13b @simba611
avx512/cmpeq: implement _mm512_mask_cmpeq_epi8_mask 88d2faf @nemequ
avx512/cmpge: finish implementing all functions 9a4d0de 0b5de15 @nemequ
avx512/cmp{g,l}e: AVX-512 implementations of non-mask functions ca1812d @nemequ
avx512/cmple: finish implementations of all cmple functions 06aa828 @nemequ
avx512/cmpneq: initial implementation of 128-bit and 256-bit functions 34194f2 @nemequ
avx512/compress: implement _mm256_mask_compress_pd d1223d4 @simba611
avx512:compress: implement _mm256_mask(z)_compress(storeu)_p* a7386b5 @simba611
avx512/compress: Mitigate poor compressstore performance on AMD Zen 4 54563e4 @mr-c
avx512/conflict: implement missing functions b6887ce c8f2755 @simba611
avx512/cvt: add _mm512_cvtepu32_ps `_mm{_mask,_maskz}_cvtepi64_pd 292e1e2 @nemequ
avx512/cvtt: add _mm{_mask,_maskz}_cvttpd_epi64 d2f518a e842f29 @nemequ
avx512/dbsad: initial implementation d659f42 0c76c5e @simba611
avx512/dpbf16: initial implementation 18b4e74 0ec8d72 @simba611
avx512/dpbusd: initial implementation 913a0a4 ff0d35a @simba611
avx512/dpbusds: complete function family 34f2488 @simba611
avx512/dpwssd: initial implementation 973df0e @simba611
avx512/dpwssds: initial implementation fe93582 @simba611
avx512/fixupimm: initial implementation 441339e @simba611
avx512/fmsub: implement fmsub functions for AXV512VL b7df811 @simba611
avx512/insert: implement inserti{,_mask,maskz}{32x8,64x2} mm512{_mask,_maskz}_insert{f32x8,64x2} 2c8b052 @simba611 8e306d1 @simba611 0ba2085 @nemequ
avx512/load_pd: initial implementation 8445684 @operasfantom
avx512/load_ps: initial implementation d588049 @operasfantom
avx512/madd: explicitly promote 16-bit elements to 32-bit e5dd146 @nemequ
avx512/madd: fix arguments for native aliases ae545ce @nemequ
avx512/mullo: implement mm512_mullo_epi64 with mask(z) 8545d26 @8545d26
avx512/multishift: initial implementation 6b125ec @simba611
avx512/or, avx512/xor: regenerate tests using 32-bit ints instead of 64 e1de51d @nemequ
avx512/or: implement mm512_mask(z)_or_ps/d functions 6cda738 b7933e6 @simba611
avx512/permutex2var: hard-code types in casts instead of using typeof 8893116 @nemequ
avx512/permutex2var: work around incorrect definition on old clang 647279d @nemequ
avx512/popcnt: initial implementation d5ec32a b17b646 @simba611
avx512/range: initial implementation mm512_range_ps/d functions d59e3f5 37ab069 8bf0305 @simba611 8bc81ca 8ccb363 6b8d8b8 b8e63b4 @nemequ
avx512/range_rounnd,round: move range_round functions out of round d382488 @simba611
avx512/rol: implement remaining functions 9a52011 @simba611
avx512/rolv: initial implementation a2e7632 b1745c5 1fa7764 @simba611
avx512/round, avx512/roundscale: add shorter vector fallbacks b542b01 @simba611
avx512/roundscale: initial implementation e47e703 6ddf1a2 98e6a60 @simba611
avx512/roundscale_round: implement remaining functions db7a52a @simba611
avx512/scalef: initial implementation 581bf31 482bf32 @simba611 22be4e8f60c159 @nemequ
avx512/set, avx512/popcnt: use _mm512_set_epi8 only when available aa5746f @nemequ
avx512/setzero: fix native aliases c900d5e @EleonoreMizo
avx512/shldv: initial implementation cddc500 @simba611 9b08cfc @nemequ
avx512/ternarylogic: initial implementation 30eb81e @nemequ 7faedd6 @simba611
avx512/unpack{hi,lo}: implement mask variants of unpacklo b2c176f @simba611
avx512/unpack{hi,lo}: implement mm256_mask(z)_unpack* functions ca8c102 @simba611
avx512/unpacklo: added vector size conditional 3924339 @simba611
avx512/unpacklo: implement mask variants of unpacklo 0c4775e @simba611
avx512/unpacklo: implement mm512_unpacklo_* functions 8582277 @simba611
avx512/xor: implement mm512_mask(z)_xor_pd/s functions 854f913 @simba611
Properly map __mm functions to __simde_mm 96c963f @psaab
simde/scalef: add scalef_ss/sd d9898e5 @simba611
gcc i686 mm*_dpbf16_ps: skip vector ops due to rounding error b721e9d @mr-c

GFNI

gfni: improve ARM NEON implementation a99a3ec @rosbif
gfni: add ARM, PPC and WASM implementations of gf2p8mul intrinsics 61126b3 @rosbif
gfni: add cast to work around -Wimplicit-int-conversion warning d066a1c @nemequ
gfni: remove unintentional dependency on vector extensions bdfa828 @nemequ
gfni: work around clang bug #50932 7d4beba @nemequ
gfni: work around error with vec_bperm on clang-10 on POWER 8620bd0 @nemequ
gfni: replace vec_and and vec_xor with & and ^ on z/arch f5577dc @nemequ
gfni: add many x86, ARM, z/Arch, PPC and WASM implementations 97eb961 @rosbif

XOP

xop: fix NEON implementation of maccs functions to use NEON types 6ecc0e3 @nemequ

F16C

f16c: initial implementation 62c1087 @nemequ
f16c: use __ARM_FEATURE_FP16_VECTOR_ARITHMETIC to detect Arm support eaeac09 @nemequ

FMA

fma: work around broken implementations of some functions on MCST LCC 269db2a @makise-homura
fma: add mls-based NEON implementations of fnmadd functions 55416aa @nemequ
fma: drop weird high-priority implementation in _mm_fmadd_ps 20922ff @nemequ
fma: use fma/fms instead of mla/mls on NEON 2fe84e5 @nemequ
fma: use NEON types in simde_mm_fnmadd_ps NEON implementation 44d38bd @nemequ
fma: fix return value of simde_mm_fnmadd_ps on NEON 87198d9 @nemequ
Fixed FMA detection macro on msvc 286ba3d @dhbloo

SVML

svml: trivial indentation fix 2176652 @nemequ
svml: remove some dead stores from cdfnorminv 11d97ba @nemequ

MIPS MSA intrinics

Begin working on implementing MIPS MSA. e9c002a @nemequ
msa/add_a: initial implementation 6b37bb3 @nemequ
msa/addvi: initial implementation 8711327 @nemequ
msa/subv: initial implementation 75b3b73 @nemequ
msa/andi: initial implementation 31b7ce7 @nemequ
msa/and: initial implementation 6635520 @nemequ
msa/adds: initial implementation c37559c @nemequ
msa/adds_a: initial implementation bb84c44 @nemequ
msa/madd: initial implementation 1b89ab3 @nemequ
Many work-arounds for GCC with MSA, and support in the docker image. e5dbb93 @nemequ

Arch support

various: correct PPC and z/Arch versions plus typo ac8d722 @rosbif
arch: __ARM_ARCH now (v8.1+) encodes the minor version b0b22d1 @nemequ
arch: set SIMDE_ARCH_ARM for AArch64 on MSVC 1d8befc @nemequ

arm64

arm64 windows: fix simd128.h build error dad8cad @Changqing-JING
mips/msa: fix for Windows ARM64 0f988c9 @Changqing-JIN
arm/neon: workaround on ARM64 windows bug b54dfcb @Changqing-JING

z/Arch

Correctly detect and handle z/Arch and its vector extensions 4a3f466 @nemequ
Fix z/Arch without zvector. b8af226 @nemequ
sse, sse2: add several z/Arch implementations 4f628ac @nemequ
sse2, sse4.1: additional z/Arch implementations for ksw2 ee24439 @milot-mirdita
Many additional z/Architecture implementations of x86 functions 5a2b035 @nemequ
se2, sse4.1: additional z/Arch implementations for ksw2 ee24439 @milot-mirdita
sse4.1, neon/bsl: v/Arch implementations of blendv/bsl functions 80a8484 @nemequ
z/Architecture implementations for remaining min/max functions 694d547 @nemequ
neon/cvt: z/Arch implementations 107fab8 @nemequ
sse, sse4.1: z/Arch implementations of some rounding functions 9fb1509 @nemequ
sse, sse2, neon/dup_n: lots of z/Arch splat-based implementations 874d51f @nemequ
gfni: add z/Arch version c12f111 @rosbif
x86,arm/neon: Correct z/Arch versions 50fba9b @rosbif
features: add z/arch to SIMDE_NATURAL_VECTOR_SIZE d41999b @nemequ
arm/neon/qdmulh s390 gcc-12: __builtin_shufflevector is misbehaving 23a2441 @mr-c

Altivec

sse, sse2: generate to/from altivec functions for SSE/SSE2 types. dd3ff53 @nemequ
docker: power9-clang ignore deprecated-altivec-src-compat warnings b70f1a2 @mr-c
sse4.1: PPC AltiVec has no vec_splat_s64 debbf73 @rosbif
arch: fix SIMDE_ARCH_POWER_ALTIVEC_CHECK to include AltiVec check 8534e64 @nemequ
simd128: add AltiVec implementations of any/all_true a3b2630 @nemequ

e2k (Elbrus

e2k: Introduce E2K (Elbrus) architecture 093b2c5 @makise-homura
e2k, ppc: Make shifts unsigned 24ddeba @makise-homura

Power

gcc power: bugs 1007[012] fixed in GCC 12.1 c23208d @mr-c
gcc power: vec_cpsgn argument reversal fixed in 12.0 296362c @mr-c

Testing with Docker/Podman & CI

gh-actions: add some bionic-era GCC builds ccdd24b @nemequ
gh-actions: add several clang builds e4b4646 @nemequ
drone: read testlog.txt if tests fail eb71d89 @nemequ
docker: add -march=z14 -mzvector to s390x-gcc-10 build. 8f60406 @nemeq
docker: use z13 instead of z14 for s390x architecture a524be2 @nemequ
docker: install meson from pip df63f88 @nemequ
docker: use meson 0.55.0 instead of 0.54.0. 5112bf2 @nemequ
docker: add platform dependent fixes for docker 3dd58b9 @Glitch18
docker: fix script exiting bug 6770ec0 @Glitch18
gh-actions: add some bionic-era GCC builds. ccdd24b @nemequ
Remove Travis CI. 17a27e7 @nemequ
gh-actions: temporarily disable emscripten build 71ea291 @nemequ
codeql: analyze the merge commit d3a40e1 @mr-c
gh-actions: automatically detect whether to use SDE bb69b54 @nemequ
download-sde: be more tolerant of changes on Intel's web site 87bb927 @nemequ
meson: require meson version 0.54 349da2b @makise-homura
testing: Require exact matches for abs functions 9085d94 @jpcima
test: replace 1e-##precision with to_slop functions 9adcc21 @nemequ
test: allow passing INT_MAX for precision for exact comparisons e903b7f @nemequ
docker: only rebuild image if older than a week d9b1322 @nemequ
docker: fix build when the image doesn't exist yet ab3b509 @nemequ
drone: configure apt to retry failed downloads 1c442b4 @nemequ
gh-actions: disable clang-3.9 build 7fcb64d @nemequ
docker: skip date check when building image for the first time a1c4728 @Glitch18
docker: allow overriding the BUILD_IMAGE setting ca6f690 @nemequ
gh-actions: use ctest to run CMake tests so we can output on failure 03f6ebe @nemequ
cirrus: add -Db_lundef=false to sanitizer buld 5a0fc02 @nemequ
gh-actions: try commit message witohut quotes on implementation-status 3f81cac @nemequ
gh-actions: add action to update the implementation-status repo 333f077 @nemequ
codecov: ignore test/ directory 65e7903 @nemequ
docker: Add a prompt before rebuilding image c2cff9f @Glitch18
docker: Fix BUILD_IMAGE always being set to 'y' 368a777 @Glitch18
travis: use -march=native and GCC on s390x 5b9b2af @nemequ
gh-actions: use -O2 instead of -O3 on emscripten 636f145 @nemequ
cmake: generate most declare-suites.h files 5d62f0d @nemequ
Add Windows ARM64 CI f12fd00 @tommyvct
gh-actions: only run mSVC Arm checks on msvc-arm branch 3d8a516 @nemequ
docker: use -O2 instead of -O3 on emscripten 3173499 @nemequ
gh-actions: switch emscripten build to Meson bde2cb1 @nemequ
ga: ubuntu-16.04 has been retired, migrate to ubuntu-18.04 6d0c65c @mr-c
ga: pin to macos-10.15 instead of -latest d64de8c @mr-c
docker: fix quoting error 830981b @mr-c
Azure: publish test results 51c24d8 @mr-c
tests: update download-iig.sh to account for Intel changes 2fdc9a5 @nemequ
test: fix download script for SDE b3b4975 @nemequ
Travis CI power9: try using all the cores to speed up b91516f @mr-c
CI: trim flags for icx/icpc 201dcdb @mr-c
CI: debian testing gcc: -Wno-error=stringop-overread af24d0c @mr-c
emscripten: turn off clang's -Wunsafe-buffer-usage for the tests 3caf71d @mr-c
update SDE download link 24338a2 @mr-c
CI: test using Intel® oneAPI DPC++/C++ Compiler instead of ICC df144ff @mr-c
update deps/images for CI 1cf39df @mr-c
GitHub Actions: Ubuntu 22.04 + system meson dd0b662 @mr-c
docker: aarch64-clang ; match drone.io flags bbe4416 @mr-c
docker: skip mips64el from cross-building d3f5fae @mr-c
Docker: tighten libstdc++NN-dev package selection c44539c @mr-c
docker: pass -future flag to sde for i686-all-gcc-9 d8658ea @mr-c
docker: icc, disable depracation notice 505f24a @mr-c
docker: add Intel ICX testing 4a4eeb6 @mr-c
docker: add more cross building profiles for modern compilers 89e2c5b @mr-c
docker: qemu package doesn't exist & is unneeded 9ec8375 @mr-c
CI: fix longsoon build on CircleCI 3db6d7a @mr-c
meson docs: don't use deprecated syntax 1a1a6eb @mr-c
CI: Update codecov to v3 for Node 16 support bd7f8df @wrv
CI: Update macos build to 11 c30a29b @wrv
CI: Comment out Ubuntu 18.04 build as will be unsupported in April 2023 6cefe47 @wrv
CI: Update to actions/checkout@v3 to avoid Node 12 warning 511b5b7 @wrv
SDE: add -future flag to support all x86 features caa3c6d @wrv
CI: add -fp-model precise for icx/icpx 7ec32ff @wrv
CI: update OSSAR action versions a1a63ac @wrv
CI: cancel GitHub Actions if there is a newer commit 8c56459 @mr-c
CI: GitHub Actions: test with gcc-12 f6db95d @mr-c
docker: enable use of ccache 4d42b90 @mr-c
docker: icx ignore no-tautological-constant-compare warning 97315b8 @mr-c
docker: add test with Debian default flags, also for armel 0a44b50 @mr-c
docker:sde tigerlake allows for advanced AVX512 testing 54b5d4e @mr-c
netlify: build amalgamated SVE header 41898ab @nemequ
travis: bring back some Travis builds 0ec9926 @nemequ
gh-actions: remove GCC 4.7 build 3997b8f @nemequ
docker: apt-get update before each other apt command 5560ca0 @nemequ
github-actions: add action to push to the simde-no-tests repository 1b4647f @milot-mirdita
gh-actions: move push-to-no-tests.yml into the right directory. 7fbb9c9 @nemequ
check-flags.sh: add lock around installing SDE 373e1e3 @nemequ
docker: add a bunch of cross files b718597 @nemequ
gh-actions: give up on getting commit ID in message for status repo 05ecb5d @nemequ
netlify: deploy wasm/simd128.h aa29a8b @nemequ
docker/Dockerfile: Use netselect-apt to speed up image build e98cf70 @Glitch18
gh-actions: add missing jobs property ddd453a @nemequ
download-iig: tweak script to fix download location 082a875 @nemequ
gh-actions, docker: add -fno-lax-vector-conversions to clang flags ccdfca9 @nemequ
sde: don't print URL in download-sde script. 55fc0e2 @nemequ
gh-actions: add -ffast-math builds for GCC and clang de616e7 @nemequ
Default to -DSIMDE_CONSTRAINED_COMPILATION when building tests 3d14f8e @nemequ
docker emscripten: remove experimental wasm flag for v8 496d88d @wrv
resume testing on aarch64 4d1639a @mr-c
docker: use qemu-*-static's elf interpreter prefix to simplify 1921112 @mr-c
github actions: cross-build & test powepc64le, s390x (later) f0f3d09 @mr-c
docker: power: meson cpu_family is just powerpc64 0643db3 @mr-c
github actions: sleef: no ccache due to -march=native c709922 @mr-c
github actions: use ccache to speed up builds 73dddb7 @mr-c
github actions: clang 1[45]; gcc 12 on riscv64 with qemu e5c02d4 @mr-c
GitHub Actions: Resume running the mscv arm tests on all branches 782d816 @mr-c
circle ci: ccache for non-native builds ee79d7d @mr-c
docker i686: cap at prescott, not =native e12eab6 @mr-c
circle ci: i686 was actually compiling for x86_64 :-( 7670e63 @mr-c
circle ci: test i686 with gcc-11 -O2 0a69604 @mr-c
circle-ci: modernize build, especially for i686 & loongson 22b73ba @mr-c
docker: ppc64el, riscv64, s390x cross compiling 3fa1d18 @mr-c
circleci: i686 gcc, only gcc-11 + O2 for now 7e70d02 @mr-c
GA Emscripten: temporarily only run "native" tests 1b6bde7 @mr-c
docker emscripten: fix v8 path, install ccache 8537dd9 @mr-c
ga: actionlint/shellcheck inspired cleanups 8182065 @mr-c
GitHub Actions qemu: resuming build+test on s390x cb6a0da @mr-c
CI: meson newer than 0.56 skips tests d08cb7b @mr-c
GitHub Actions: drop cmake for meson. 9d69cff @mr-c
circleci & cirrus: pipx instead of pip for meson 3211797 @mr-c
semaphore CI: fix test execution by using mason 1b05684 @mr-c
GitHub Actions: more skipping of emul tests to save time b473b8b @mr-c
Appveyor: use ccache 56b2ff2 @mr-c
Azure CI: user newer clang for check 3b663fe @mr-c

Misc

Improve abs function performance on SSE/SSE2 093f6ee @jpcima
Upgrade Hedley to v15 0d070e1 @nemequ
detect-clang: fix version numbers for clang < 4.0 8a2c645 @nemequ
align: add MCST LCC to compilers known to support __alignof__ 38e3840 @nemequ
common: add an MCST LCC check for vector features. e38fe50 @nemequ
complex: fix checks for GCC C complex math support ad8c7e0 @nemequ
Fix SIMDe link in no-tests README 21f7a2a @maxbachmann
common: enable OpenMP by default on LCC ff34d1b @nemequ
README: more thoroughly document OpenMP support 46c65e1 @nemequ
Add some files to .gitignore 8381a57 @nemequ
check-flags.sh: move download location from ~ to /opt/intel a361527 @nemequ
simde-features: fix C&P error 00fd88d @rosbif
{neon,simd128,avx512/abs}: provide vector versions of i64 abs d3976e0 @nemequ
common: improve check for C11 generic selections 11d2a6d @nemequ
common: don't use aligned OpenMP clause on MCST LCC a9a5a0d @nemequ
math: use simde_math_-prefixed abs/labs/llabs 813f4f0 @nemequ
diagnostic: silence -Wreserved-identifier warning from LLVM 0b6f5b2 @nemequ
Fix compilation with clang on POWER 5c43ac0 @nemequ
Work around issues preventing compilation on NVCC 3815c04 @nemequ
Don't set SIMDE_NO_CHECK_IMMEDIATE_CONSTANT in tests. 0c9fe4c @nemequ
common: move conversion functions for u32 <-> f32 into common 37e187c @nemequ
Add SIMDE_FAST_EXCEPTIONS option d01d58e @nemequ
Use SIMDE_HUGE_FUNCTION_ATTRIBUTES on several functions. 552c202 @nemequ
Add -s ENVIRONMENT=shell to emscripten flags 69d7655 @nemequ
Fix an assortment of small bugs 8b5d68c @simba611
Remove all && 0s in preprocessor macros. b6f21a9 @nemequ
Add constrained compilation mode a992f5b @simba611
Fix gcc-10 compilation on s/390x a10f12e @nemequ
simde-diagnostic: Include simde-arch 61cd8aa @Glitch18
Add many fast floating point to integer conversion functions 1fbe712 @nemequ
common: Use AArch64 intrinsics if _M_ARM64EC is defined 2a9e7b7 @tommyvct
Add -Wdeclaration-after-statement to the list of ignored warnings. bba815d @nemequ
Work around compound literal warning with clang 90523a2 @dgazzoni
Various fixes for -fno-lax-vector-conversions 39d902e @nemequ
Fix warnings with -fno-lax-vector-conversions e5ff228 @ngzhian
Improve widening pairwise addition implementations 3b950bb @nemequ
Wrap static assertions in code to disable -Wreserved-identifier d1fc7b5 @nemequ
Add missing static const in simde-math.h. NFC 6bd6562 @sbc100
wasm128, sse2: disable -Wvector-conversion when calling vgetq_lane_s64 679b970 @nemequ
test: skip NAN producing (sub-)tests for -ffast-math eb99f7c @mr-c
README: add CodeCov.io badge; freshen chat link 1c48030 @mr-c
emscripten; don't use __builtin_roundeven{f,} even if defined 51b9941 @mr-c

Template for next time

# Summary

# Details

## Implementation of NEON intrinsics:

## SVML

## x86 intrinsics

### MMX

### SSE*

### AVX

### AVX2

### AVX512

### GFNI 

### XOP

### F16C

## Testing with Docker/Podman & CI

## Misc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly