19 Feb 01:05

vpirogov

5e92240

v3.7 Latest

Latest

Performance Optimizations

Intel Architecture Processors

Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
Improved performance of int8 RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
Improved performance of int8 depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.
Improved fp16 and bf16 softmax performance with relaxed accumulation mode.
Improved performance of int8 matmul primitive with fp16 output data type.
Improved performance of the following subgraphs with Graph API:
- Gated Multi-Layer Perceptron (Gated MLP).

Intel Graphics Products

Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
Improved performance of convolution with source zero points by pre-packing compenstation.
Improved performance of backward by data convolution with strides for large filter.
Improved performance of the following subgraphs with Graph API:
- Scaled Dot-Product Attention (SDPA) with implicit causal mask.
- SDPA with int8 or int4 compressed key and value.
- Gated MLP.

AArch64-based Processors

Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
Improved bf16 to fp32 reorder performance.
Improved bf16 reorder performance.
Improved bf16 convolution with ACL.

NVIDIA GPUs

Improved matmul performance using cuBLASLt-based implementation.

Functionality

Common

Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
Introduced initial support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
Introduced GenIndex, and GreaterEqual operations in Graph API.

Intel Architecture Processors

Introduced support for fp32 matmul with fp16 and bf16 weights.

Intel Graphics Products

Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
Introduced support for strided memory formats in convolution.

Generic GPU vendor

Introduced support for reduction primitive.
Introduced support for inner product primitive forward propagation.

Usability

Common

With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
Added Graph API examples for Gated MLP and int4 Gated MLP patterns.

Intel Architecture Processors

Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations.
Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

Intel Processor Graphics

Improved verbose diagnostics for Intel GPU driver compatibility issues.
Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
Reduced scratchpad usage for NCHW convolution on Intel GPUs.

AArch64-based Processors

Added support for the Arm Compute Library (ACL) thread_local scheduler via ThreadpoolScheduler.
Improved memory efficiency in ACL matmuls by fixing a bug where scratchpad memory was not being used.
Made the ACL matmul primitive thread-safe which allows concurrent execution.

Validation

Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.

Deprecated Functionality

Experimental Graph Compiler is deprecated and will be removed in future releases.

Breaking Changes

Updated minimal supported CMake version to 3.13 (was 2.8.12).
Updated minimal supported GCC version to 8.0 (was 4.8).
Updated minimal supported Clang version to 11.0 (was 3.0).
Updated minimal supported ACL version to 24.11.1 (was 24.09).
Removed support for SYCL standards preceding SYCL 2020.
Enforced fp32 accumulation mode in fp16 matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed accumulation mode.

Thanks to our Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Karasev @karasjoh000, John Osorio @kala855, Keola Wierschem @kwiersch, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nicolò Scipione @s-Nick, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Tadej Ciglarič @t4c1, Varad Ahirwadkar @varad-ahirwadkar, Viktoriia Gvozdeva @vgvozdeva, @vishwascm, @yair-obodovsky, Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

Contributors

mgorny, kala855, and 27 other contributors

Assets 2

27 Jan 09:25

vgvozdeva

v3.7-rc

c538dc5

v3.7-rc Pre-release

Pre-release

Performance Optimizations

Intel Architecture Processors

Improved fp16/bf16 softmax performance with relaxed accumulation mode.
Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
Improved performance of convolution and matmul primitives on processors with Intel AMX support.
Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
Improved performance of int8 matmul primitive with fp16 output datatype.
Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.

Intel Graphics Products

Introduced initial optimizations for GPUs based on Xe3 architecture.
Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
Improved performance of the following subgraphs with Graph API
- Scaled dot-product Attention (SDPA) with implicit causal mask
- Scaled dot-product Attention (SDPA) with int8/int4 compressed key and value

Functionality

Introduced support for select algorithm in binary primitive. The functionality is optimized for Intel CPUs.
Enabled support for matmul primitive with grouped quantization on weight along N dimension.
Graph API: new Select, GenIndex and GreaterEqual operations.
Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
Introduced support for grouped scales and zero points in reorder primitive.
Enabled support for 4d weight scale in matmul primitive.
Graph API: added support for Quantized and non-quantized Gated MLP pattern.
Introduced preliminary support for 4-bit floating-point data types f4_e2m1 and f4_e3m0 in matmul and reorder, as well as e8m0 scales data type in matmul and reorder.

Usability

With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
Added examples for Gated MLP and int4 Gated MLP.

Validation

Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
Extended benchdnn with support and validation for the number of partition returned from the test JSON files.

Breaking Changes

Updated minimal supported CMake version to 3.13 (was 2.8.12).
Updated minimal supported GCC version to 8.0 (was 4.8).
Updated minimal supported Clang version to 11.0 (was 3.0).
Removed support for SYCL older than 2020.

Thanks to these Contributors

This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

Contributors

mgorny, kala855, and 21 other contributors

Assets 2

06 Dec 05:18

tprimak

v3.6.2

2eb3dd1

v3.6.2

This is a patch release containing the following changes to v3.6.1:

Fixed segmentation fault issue in convolution primitive on processors with Intel AVX2 instruction set support (2eb3dd1)
Added a workaround for build issue with GCC 8.2 and GNU binutils 2.27 (19ef223, 262fb02, e3782e8)
Fixed a thread safety issue in matmul primitive for builds relying on Arm Compute Library (ACL) and bumped minimal supported ACL version to 24.11.1 (4d962e7)
Suppressed spurious warnings for GCC (7d3164d, c805a50, e526172, dc780cb)
Fixed segfaults in BRGEMM-based matmul, convolution, and deconvolution implementations on AArch64-based processors (a873a1c, 9a1dc92)
Fixed performance regression in bf16 convolution with ACL on AArch64-based processors (4793296)
Fixed an issue with convolution primitive creation with PREFER_YMM CPU ISA hint on AArch64-based processors (e34d992)
Improved bf16 matmul performance with fp32 destination with ACL on AArch64-based processors (548d5d6)
Improved bf16 to fp32 reorder performance on AArch64-based Processors (917dd13)
Fixed issue in matmul primitive with 4D tensors on AArch64-based processors (d13c966)
Suppressed spurious GCC warnings in deconvolution primitive on AArch64-based processors (f90f60e)
Fixed warnings in BRGEMM implementation on AArch64-based processors (866b196)
Fixed correctness issue in reorder primitive with zero points for 4D shapes on AArch64-based Processors (836ea10)
Improved bf16 reorder performance on AArch64-based Processors (12bafbe)
Fixed performance regression for backward convolution primitive descriptor creation time on Intel processors (2b3389f)
Improved performance of fp16 matmul with int4 weights on Intel GPUs based on Xe2 architecture (4c8fb2c, 3dd4f43, 280bd28)
Fixed performance regression for int8 convolution with large spatial sizes on processors with Intel AMX support (05d68df)
Restricted check for microkernel fusion support to cases when fusion functionality is actually used on Intel GPUs (48f6bd9)

Assets 2

06 Nov 00:05

vpirogov

v3.6.1

e72f65d

v3.6.1

This is a patch release containing the following changes to v3.6:

Fixed convolution correctness issue in some scenarios involving persistent cache on Intel GPUs (e595e59)
Fixed potential page faults in reduction primitive implementation for Intel GPUs (7740c75, a4fcef9, 32d8660)
Implemented a workaround for GCC 13 bug that resulted in matmul hangs on some Intel Arc graphics SKUs (a30d526)
Updated execution units (EU) number detection logic for Intel GPUs based on Xe2 architecture to accommodate for behavior changes in Linux driver (04e7eac, 97b04bd)
Fixed build issue for static library with ONEDNN_VERBOSE=OFF (7f476cb)
Fixed correctness issue in SYCL deconvolution implementation with post-ops (8f600a3)
Fixed memory formats checks in SYCL softmax implementation (6ae73e4)
Fixed correctness issue in SYCL resampling implementation with post-ops (9845057)
Aligned accessor types in SYCL kernels with SYCL specification (0d9b3bd)
Improved scales argument checks in generic SYCL kernels (9f73bf1, 7d85c75)
Fixed correctness issue in int8 convolution with sum post-op on NVIDIA GPUs (7486ed8)
Relaxed accuracy test threshold for bf16 softmax on NVIDIA GPUs (e9d0fdb)
Added support for bf16 and fp16 bias for fp8 matmul on Intel CPUs (188ae7f)
Fixed a bug that prevented dispatching Intel AVX-512 with Intel DL Boost implementation in int8 RNN primitive (bf58e72)
Fixed a runtime fail with CL_OUT_OF_RESOURCES error in fp16 convolution on Intel Arc graphics (39a5f67, 7e1663f)

Assets 2

15 Oct 01:48

tprimak

v3.6

1794c34

v3.6

Performance Optimizations

Intel Architecture Processors

Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
Improved performance of group normalization primitive.
Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
Improved performance of fp8 matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
Improved fp32 RNN primitive performance on processors with Intel AVX2 instruction set support.
Improved performance of the following subgraphs with Graph API:
- convolution and binary operation fusions with better layout selection in Graph API.
- fp8 convolution and unary or binary on processors with Intel AMX instruction set support.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

Intel Graphics Products

Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
Introduced support for Intel Arc Graphics for future Intel Core Ultra processor (code name Arrow Lake-H).
Improved performance of fp8_e5m2 primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
Improved int8 convolution performance with weight zero-points.
Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. f16 variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- fp8, convolution, and unary or binary on the Intel Data Center GPU Max Series.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

AArch64-based Processors

Improved fp32 convolution backpropagation performance on processors with SVE support.
Improved reorder performance for blocked format on processors with SVE support.
Improved bf16 softmax performance on processors with SVE support.
Improved batch normalization performance on processors with SVE support.
Improved matmul performance on processors with SVE support.
Improved fp16 convolution with Arm Compute Library (ACL).
Improved matmul performance with ACL.
Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality

Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based implementations.
Enabled support for int8 activations with grouped scales and int8 or int4 compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
Introduces support for stochastic rounding for fp8 data type functionality.
[experimental] Extended microkernel API:
- Introduced int8 quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
[experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
Introduced int8 support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
Graph API:
- Introduced GroupNorm operation and fusions in Graph API.
- Introduced support for standalone StaticReshape and StaticTranspose operations.

Usability

Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
Added an example for deconvolution primitive.
Added examples for Vanilla RNN and LBR GRU RNN cells.
Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0.
Introduced interoperability with SYCL Graph record/replay mode.
Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
[experimental] Introduced logging mechanism based on spdlog library.
Introduced support for ONEDNN_ENABLE_WORKLOAD build knob for Graph API.
Improved performance of get_partitions() function in Graph API.

Validation

Introduced protection from out-of-memory scenarios in benchdnn Graph API driver.

Deprecated Functionality

Experimental Graph Compiler is deprecated and will be removed in future releases.

Breaking Changes

Experimental microkernel API in this release is not compatible with the version available in oneDNN v3.5.
Updated minimal supported ACL version to 24.08.1 (was 24.04).

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich,
Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.

Contributors

fwph, nwnk, and 29 other contributors

Assets 2

20 Sep 23:29

vpirogov

v3.6-rc

1fafd12

v3.6-rc Pre-release

Pre-release

Performance Optimizations

Intel Architecture Processors

Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
Improved performance of group normalization primitive.
Improved bf16 matmul performance with int4 compressed weights on processors with Intel AMX instruction set support.
Improved performance of fp8 matmul, pooling, and eltwise primitives on processors with Intel AMX instruction set support.
Improved fp32 RNN primitive performance on processors with Intel AVX2 instruction set support.
Improved performance of the following subgraphs with Graph API:
- convolution and binary operation fusions with better layout selection in Graph API.
- fp8 convolution and unary or binary on processors with Intel AMX instruction set.
- Scaled Dot Product Attention (SDPA) without scale, Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

Intel Graphics Products

Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced broad production quality optimizations for Intel Arc Graphics for Intel Core Ultra Processors (Series 2) (formerly Lunar Lake).
Introduced broad production quality optimizations for future discrete GPU based on Xe2 architecture (code name Battlemage).
Introduced support for Intel Arc Graphics for future Intel Core Ultra Processor (code name Arrow Lake-H).
Improved performance of fp8_e5m2 primitives on Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Improved matmul and inner product primitives performance for shapes relevant to large language models (LLMs) on GPUs with Intel XMX support.
Improved int8 convolution performance with weight zero points.
Reduced primitive creation time for softmax, layer normalization, and concat primitives via kernel reuse.
Improved performance of the following subgraphs with Graph API:
- SDPA without scale, MQA, and GQA patterns. f16 variants of these patterns significantly benefit from Intel(R) Xe Matrix Extensions (Intel(R) XMX) support.
- fp8 convolution and unary or binary on Intel Data Center GPU Max Series.
- LayerNorm, GroupNorm, and SoftMax with int8 quantized output and zero-points.

AArch64-based Processors

Improved fp32 convolution backpropagation performance on processors with SVE support.
Improved reorder performance for blocked format on processors with SVE support.
Improved bf16 softmax performance on processors with SVE support.
Improved batch normalization performance on processors with SVE support.
Improved matmul performance on processors with SVE support.
Improved fp16 convolution with Arm Compute Library (ACL).
Improved matmul performance with ACL.
Switched matmul and convolution implementation with ACL to stateless API significantly improving primitive creation time and increasing caching efficiency and performance for these operators.

Functionality

Introduced generic GPU support. This implementation relies on portable SYCL kernels and can be used as a starting point to enable new devices in oneDNN.
Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL based implementations.
Enabled support for int8 activations with grouped scales and int8 or int4 compressed weights in matmul primitive. This functionality is implemented on Intel GPUs.
Introduces support for stochastic rounding for fp8 data type functionality.
[experimental] Extended microkernel API:
- Introduced int8 quantization support.
- Extended transform microkernel with transposition support and support for arbitrary strides.
- Introduced verbose diagnostics support.
[experimental] Extended sparse API:
- Introduced support for sparse memory with coordinate (COO) storage format.
- Extended matmul primitive to work with sparse memory in COO format. This functionality is implemented on CPUs and Intel GPUs.
Introduced int8 support in eltwise primitive with 'clip' algorithm. This functionality is implemented on CPUs.
Graph API:
- Introduced GroupNorm operation and fusions in Graph API.
- Introduced support for standalone StaticReshape and StaticTranspose operations.

Usability

Added examples for SDPA, MQA, and GQA patterns implementation with Graph API.
Added an example for deconvolution primitive.
Added examples for Vanilla RNN and LBR GRU RNN cells.
Introduced support for Intel DPC++/C++ Compiler 2025.0.
Introduced interoperability with SYCL Graph record/replay mode.
Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
[experimental] Introduced logging mechanism based on spdlog library.
Introduced support for ONEDNN_ENABLE_WORKLOAD build knob for Graph API.
Improved performance of get_partitions() function in Graph API.

Validation

Introduced protection from out of memory scenarios in benchdnn Graph API driver.

Breaking Changes

Experimental microkernel API in this release is not compatible with the version available in oneDNN v3.5.
Updated minimal supported ACL version to 24.08.1 (was 24.04).

Thanks to these Contributors

This release contains contributions from the project core team as well as Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts @apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, @matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov @vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone who asked questions and reported issues.

Contributors

fwph, nwnk, and 29 other contributors

Assets 2

02 Aug 22:26

vpirogov

v3.5.3

66f0cb9

v3.5.3

This is a patch release containing the following changes to v3.5.2:

Fixed correctness issue in convolution weight gradient for small shapes on Intel GPUs (49eee6a, 281dd3b)
Extended MLP patterns supported by experimental Graph Compiler to cover cases relevant to ChatGLM model (ff680fc)
Fixed performance regression in bf16 depthwise convolution on Intel CPUs (d6c216a)

Assets 2

26 Jul 23:34

vpirogov

v3.5.2

d53e3ce

v3.5.2

This is a patch release containing the following changes to v3.5.1:

Fixed performance regression for some Graph API subgraphs with LayerNorm operation (82f629c)
Fixed runtime error for Graph API subgraphs including 6D LayerNorm operation (f704f09)
Fixed an issue with host compiler version detection in SYCL configurations (730b976)
Fixed an issue with missing DNNL_TARGET_ARCH define for builds not relying on CMake (87848b9)
Fixed a test issue for matmul with low-precision scales and/or zero-points (91c35d8)
Fixed segfault issue in bfloat16 shuffle on AArch64 processors (9116681)
Fixed runtime issue in quantized layer normalization pattern with Graph API (0013e8c)

Assets 2

22 Jul 16:03

vpirogov

v3.4.4

b2b7605

v3.4.4

This is a patch release containing the following changes to v3.4.3:

Fixed an issue with host compiler version detection in SYCL configurations (fcaa1b4)

Assets 2

16 Jul 15:39

vpirogov

v3.5.1

2340f5a

v3.5.1

This is a patch release containing the following changes to v3.5:

Fixed potential page fault in matmul on Intel Datacenter Max Series GPUs (a9c525d)
Fixed potential stack overflow issue in convolution implementation for Intel GPUs (0fb7e6e)
Added test cases for matmul with compressed weights (015ccb1)
Extended Graph API LayerNorm operation with zero points support (dc2701a)
Fixed primitive creation error for depthwise convolution backpropagation on Intel GPUs (4a045e4, b529d22)

Assets 2

Releases: oneapi-src/oneDNN

v3.7

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

NVIDIA GPUs

Functionality

Common

Intel Architecture Processors

Intel Graphics Products

Generic GPU vendor

Usability

Common

Intel Architecture Processors

Intel Processor Graphics

AArch64-based Processors

Validation

Deprecated Functionality

Breaking Changes

Thanks to our Contributors

Contributors

v3.7-rc

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

Functionality

Usability

Validation

Breaking Changes

Thanks to these Contributors

Contributors

v3.6.2

v3.6.1

v3.6

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Usability

Validation

Deprecated Functionality

Breaking Changes

Thanks to these Contributors

Contributors

v3.6-rc

Performance Optimizations

Intel Architecture Processors

Intel Graphics Products

AArch64-based Processors

Functionality

Usability

Validation

Breaking Changes

Thanks to these Contributors

Contributors

v3.5.3

v3.5.2

v3.4.4

v3.5.1