Delta Lake 3.0.0
We are excited to announce the final release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.
Highlights
Here are the most important aspects of 3.0.0:
Spark 3.5 Support
Unlike the initial preview release, Delta Spark is now built on top of Apache Spark™ 3.5. See the Delta Spark section below for more details.
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.0.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13
Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this release. UniForm takes advantage of the fact that all table storage formats, such as Delta, Iceberg, and Hudi, actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata and commits to Hive metastore, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create a UniForm-enabled table using the following command:
CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES (
'delta.universalFormat.enabledFormats' = 'iceberg');
Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details, and the key implementations here and here.
Delta Kernel
- API documentation: https://docs.delta.io/3.0.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).
You can use this library to do the following:
- Read data from Delta tables in a single thread in a single process.
- Read data from Delta tables using multiple threads in a single process.
- Build a complex connector for a distributed processing engine and read very large Delta tables.
- [soon!] Write to Delta tables from multiple threads / processes / distributed engines.
Reading a Delta table with Kernel APIs is as follows.
TableClient myTableClient = DefaultTableClient.create() ; // define a client
Table myTable = Table.forPath(myTableClient, "/delta/table/path"); // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient); // define which version of table to scan
Predicate scanFilter = ... // define the predicate
Scan myScan = mySnapshot.getScanBuilder(myTableClient) // specify the scan details
.withFilters(scanFilter)
.build();
Scan.readData(...) // returns the table data
Full example code can be found here.
For more information, refer to:
- User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
- Slides explaining the rationale behind Kernel and the API design.
- Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
- Table and default TableClient API Java documentation
This release of Delta contains the Kernel Table API and default TableClient API definitions and implementation which allow:
- Reading Delta tables with optional Deletion Vectors enabled or column mapping (name mode only) enabled.
- Partition pruning optimization to reduce the number of data files to read.
Welcome Delta Connectors to the Delta repository!
All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.
Delta Spark
Delta Spark 3.0.0 is built on top of Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.
- Documentation: https://docs.delta.io/3.0.0/index.html
- API documentation: https://docs.delta.io/3.0.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/3.0.0/
The key features of this release are:
- Support for Apache Spark 3.5
- Delta Universal Format - Write as Delta, read as Iceberg! See the highlighted section above.
- Up to 10x performance improvement of UPDATE using Deletion Vectors - Delta UPDATE operations now support writing Deletion Vectors. When enabled, the performance of UPDATEs will receive a significant boost.
- More than 2x performance improvement of DELETE using Deletion Vectors - This fix improves the file path canonicalization logic by avoiding calling expensive
Path.toUri.toString
calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table). - Up to 2x faster MERGE operation - MERGE now better leverages data skipping, the ability to use the insert-only code path in more cases, and an overall improved execution to achieve up to 2x better performance in various scenarios.
- Support streaming reads from column mapping enabled tables when
DROP COLUMN
andRENAME COLUMN
have been used. This includes streaming support for Change Data Feed. See the documentation here for more details. - Support specifying the columns for which Delta will collect file-skipping statistics via the table property
delta.dataSkippingStatsColumns
. Previously, Delta would only collect file-skipping statistics for the first N columns in the table schema (default to 32). Now, users can easily customize this. - Support zero-copy convert to Delta from Iceberg tables on Apache Spark 3.5 using
CONVERT TO DELTA
. This feature was excluded from the Delta Lake 2.4 release since Iceberg did not yet support Apache Spark 3.4 (or 3.5). This command generates a Delta table in the same location and does not rewrite any parquet files. - Checkpoint V2 - Introduced a new Checkpoint V2 format in Delta Protocol Specification and implemented read/write support in Delta Spark. The new checkpoint v2 format provides more reliability over the existing v1 checkpoint format.
- Log Compactions - Introduced new log compaction files in Delta Protocol Specification which could be useful in reducing the frequency of Delta checkpoints. Added read support for log compaction files in Delta Spark.
- Safe casts enabled by default for UPDATE and MERGE operations - Delta UPDATE and MERGE operations now result in an error when values cannot be safely cast to the type in the target table schema. All implicit casts in Delta now follow
spark.sql.storeAssignmentPolicy
instead ofspark.sql.ansi.enabled
. - General Apache Spark catalog support for auxiliary commands – Several popular auxiliary commands now support general table resolution in Apache Spark. This simplifies the code and also makes it possible to use these commands with custom table catalogs based on Delta Lake tables. The following commands are now supported in this way: VACUUM, RESTORE TABLE, DESCRIBE DETAIL, DESCRIBE HISTORY, SHALLOW CLONE, OPTIMIZE.
Other notable changes include
- Fix for a bug in MERGE statements that contain a scalar subquery with non-deterministic results. Such a subquery can return different results during source materialization, while finding matches, and while writing modified rows. This can cause rows to be either dropped or duplicated.
- Fix for potential resource leak when DV file not found during parquet read
- Support protocol version downgrade
- Fix to initial preview release to support converting null partition values in UniForm
- Fix to WRITE command to not commit empty transactions, just like what DELETE, UPDATE, and MERGE commands do already
- Support 3-part table name identifier. Now, commands like
OPTIMIZE <catalog>.<db>.<tbl>
will work. - Performance improvement to CDF read queries scanning in batch to reduce the number of cloud requests and to reduce Spark scheduler pressure
- Fix for edge case in CDF read query optimization due to incorrect statistic value
- Fix for edge case in streaming reads where having the same file with different DVs in the same batch would yield incorrect results as the wrong file and DV pair would be read
- Prevent table corruption by disallowing
overwriteSchema
when partitionOverwriteMode is set to dynamic - Fix a bug where DELETE with DVs would not work on Column Mapping-enabled tables
- Support automatic schema evolution in structs that are inside maps
- Minor fix to Delta table path URI concatenation
- Support writing parquet data files to the
data
subdirectory via the SQL configurationspark.databricks.delta.write.dataFilesToSubdir
. This is used to add UniForm support on BigQuery.
Delta Flink
Delta-Flink 3.0.0 is built on top of Apache Flink™ 1.16.1.
- Documentation: https://github.com/delta-io/delta/tree/branch-3.0/connectors/flink
- API Documentation: https://docs.delta.io/3.0.0/api/java/flink/index.html
- Maven artifact: delta-flink
The key features of this release are
- Support for Flink SQL and Catalog. You can now use the Flink/Delta connector for Flink SQL jobs. You can CREATE Delta tables, SELECT data from them (uses the Delta Source), and INSERT new data into them (uses the Delta Sink). Note: for correct operations on the Delta tables, you must first configure the Delta Catalog using CREATE CATALOG before running a SQL command on Delta tables. For more information, please see the documentation here.
- Significant performance improvement to Global Committer initialization - The last-successfully-committed delta version by a given Flink application is now loaded lazily significantly reducing the CPU utilization in the most common scenarios.
Other notable changes include
- Fix a bug where Flink STRING types were incorrectly truncated to type VARCHAR with length 1
Delta Standalone
- Documentation: https://docs.delta.io/3.0.0/delta-standalone.html
- API Documentation: https://docs.delta.io/3.0.0/api/java/standalone/index.html
- Maven artifacts: delta-standalone_2.12, delta-standalone_2.13
The key features in this release are:
- Support for disabling Delta checkpointing during commits - For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration property
io.delta.standalone.checkpointing.enabled
tofalse
. This is only safe and suggested to do if another job will periodically perform the checkpointing. - Performance improvement to snapshot initialization - When a delta table is loaded at a particular version, the snapshot must contain, at a minimum, the latest protocol and metadata. This PR improves the snapshot load performance for repeated table changes.
- Support adding absolute paths to the Delta log - This now enables users to manually perform
SHALLOW CLONE
s and create Delta tables with external files. - Fix in schema evolution to prevent adding non-nullable columns to existing Delta tables
Credits
Adam Binford, Ahir Reddy, Ala Luszczak, Alex, Allen Reese, Allison Portis, Ami Oka, Andreas Chatzistergiou, Animesh Kashyap, Anonymous, Antoine Amend, Bart Samwel, Bo Gao, Boyang Jerry Peng, Burak Yavuz, CabbageCollector, Carmen Kwan, ChengJi-db, Christopher Watford, Christos Stavrakakis, Costas Zarifis, Denny Lee, Desmond Cheong, Dhruv Arya, Eric Maynard, Eric Ogren, Felipe Pessoto, Feng Zhu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Gopi Krishna Madabhushi, Grzegorz Kołakowski, Hang Jia, Hao Jiang, Herivelton Andreassa, Herman van Hovell, Jacek Laskowski, Jackie Zhang, Jiaan Geng, Jiaheng Tang, Jiawei Bao, Jing Wang, Johan Lasperas, Jonas Irgens Kylling, Jungtaek Lim, Junyong Lee, K.I. (Dennis) Jung, Kam Cheung Ting, Krzysztof Chmielewski, Lars Kroll, Lin Ma, Lin Zhou, Luca Menichetti, Lukas Rupprecht, Martin Grund, Min Yang, Ming DAI, Mohamed Zait, Neil Ramaswamy, Ole Sasse, Olivier NOUGUIER, Pablo Flores, Paddy Xu, Patrick Pichler, Paweł Kubit, Prakhar Jain, Pulkit Singhal, RunyaoChen, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Siying Dong, Son, Tathagata Das, Terry Kim, Tom van Bussel, Venki Korukanti, Wenchen Fan, Xinyi, Yann Byron, Yaohua Zhao, Yijia Cui, Yuhong Chen, Yuming Wang, Yuya Ebihara, Zhen Li, aokolnychyi, gurunath, jintao shen, maryannxue, noelo, panbingkun, windpiger, wwang-talend, sherlockbeard