Releases: delta-io/delta
Delta Lake 4.0.0
We are excited to announce the final release of Delta Lake 4.0.0! This release includes several exciting new features.
Highlights
- [Spark] Preview support for catalog-managed tables, a new table feature that transforms Delta Lake into a catalog-oriented lakehouse table format. This feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change.
- [Spark] Delta Connect is an extension for Spark Connect which enables the usage of Delta over Spark Connect, allowing Delta to be used with the decoupled client-server architecture of Spark Connect.
- [Spark] Support for the Variant data type to enable semi-structured storage and data processing, for flexibility and performance.
- [Spark] Support a new DROP FEATURE implementation that allows dropping table features instantly without truncating history.
- [Kernel] Support for reading and writing version checksum.
- [Kernel] Support reading log compaction files for better performance during snapshot construction, and support writing log compaction files as a post commit hook.
- [Kernel] Support for the Clustered Table feature which enables defining and updating the clustering columns on a table.
- [Kernel] Support for writing to row tracking enabled tables.
- [Kernel] Support for writing file statistics to the Delta log when they are provided by the engine. This enables data skipping using query filters at read time.
Details by each component.
Sunset of Delta Standalone and dependent connectors
Currently, Delta Standalone and its dependent connectors, including Delta Flink and Delta Hive, are no longer under active development. Starting in Delta 4.0 we will not be releasing these projects as part of the 4.x Delta releases. These connectors are in maintenance mode and, going forward, will only receive critical security fixes and high-severity bug patches in the 3.x series. We are committed to a full transition from Delta Standalone to Delta Kernel and a future Kernel-based Flink connector.
Delta Spark
Delta Spark 4.0 is built on Apache Spark™ 4.0 . Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.
- Documentation: https://docs.delta.io/4.0.0/index.html
- API documentation: https://docs.delta.io/4.0.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-connect-client_2.13, delta-connect-common_2.13, delta-connect-server_2.13
- Python artifacts: https://pypi.org/project/delta-spark/4.0.0/
The key features of this release are:
- Delta Connect adds Spark Connect support to Scala and Python APIs of Delta Lake for Apache Spark. Spark Connect is a new project released in Apache Spark 4.0 that adds a decoupled client-server infrastructure which allows remote connectivity from Spark from everywhere. Delta Connect makes the DeltaTable interfaces compatible with the new Spark Connect protocol. For more information on how to use Delta Connect, see the Delta Connect documentation. Delta Connect is currently in preview.
- Preview support for catalog-managed tables: Delta Spark now supports reading from and writing to tables that have the
catalogOwned-preview
feature enabled. This feature allows a catalog to broker all commits to the table it manages, giving the catalog the control and visibility it needs to prevent invalid operations (e.g. commits that violate foreign key constraints), enforce security and access controls, and opens the door for future performance optimizations. Currently write support includesINSERT
,MERGE INTO
,UPDATE
, andDELETE
operations.- Note: this feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change. The
catalogOwned-preview
feature should not be enabled for production tables and tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Note: this feature is still in the RFC stage, and as such, the protocol is still under development and is subject to change. The
- Support for Variant data type: The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the Spark Variant binary encoding format. This feature was originally released in preview as part of Delta 4.0.0 Preview, as of 4.0.0 this feature is no longer in preview. Please see the documentation and the example for more details.
- Preview support for shredded variants: Shredded variants are a storage optimization which allow for efficient sub-field extraction at the cost of higher write overhead, showing up to 20x read performance improvement. Shredded Variant data is stored according to the Parquet Variant Shredding specification. See the variantShredding RFC for more details.
- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Type Widening now supports a broader set of type changes and is no longer in preview. This feature allows you to change the data type of a column in your Delta table without rewriting the underlying data files. See the type widening documentation for a list of all supported type changes and additional information. Delta 3.3 or above is required to read tables with type widening enabled.
- Support dropping table features without truncating history: The current drop feature implementation requires the execution of the command twice with a 24 hour waiting time in between. In addition, it also results in the truncation of the history of the Delta table to the last 24 hours. The new
DROP FEATURE
implementation allows dropping features instantly without truncating history. Dropping a feature introduces a new writer feature to the table, thecheckpointProtection
feature.- Dropping a feature with the new behaviour can be achieved as follows:
ALTER TABLE table_name DROP FEATURE feature_name
- We can still drop a feature with the old behavior as follows:
ALTER TABLE table_name DROP FEATURE feature_name TRUNCATE HISTORY
- The
checkpointProtection
feature can be dropped with history truncation.
Other notable changes include:
- Support dropping table features using the DeltaTable Scala/Python APIs with
deltaTable.dropFeatureSupport
. - Support dropping the
deletionVector
table feature. - Support DataFrameReader options to unblock non-additive schema changes when streaming.
- Invariant checks for DML commands to detect potential bugs in Delta or Spark earlier during execution and prevent committing the transaction in these cases.
- Support the
timestampdiff
andtimestampadd
expressions for generated columns. - Support sorting within partitions when Z-ordering. This can be e...
Delta Lake 3.3.2
We are excited to announce the release of Delta Lake 3.3.2! This release contains several important bug fixes and improvements to the 3.3.1 release and it is recommended that users upgrade to 3.3.2.
Component specific bug fixes are detailed below.
Delta Spark
Delta Spark 3.3.2 is built on Apache Spark™ 3.5.3. Similarly to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/3.3.2/
The key fixes in this release are:
- Fix to clean up stale checksum files during Metadata cleanup to improve table maintenance
Delta Kernel
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The key fixes in this release are:
- Kernel improvement to eliminate dependencies on package-private Parquet classes for better compatibility with JVM environments with multiple class loaders.
Other projects
Delta Flink (Delta-Standalone based)
- Maven artifacts: delta-flink
The key fixes in this release are:
- Flink fix to correct mapping between Delta's BinaryType and Flink's data types for improved type compatibility.
Credits
Dhruv Arya, Prakhar Jain, Venkateshwar Korukanti, Scott Sandre
Delta Lake 3.3.1
We are excited to announce the release of Delta Lake 3.3.1! This release contains a few bug fixes to the 3.3.0 release and it is recommended that users upgrade to 3.3.1.
Component specific bug fixes are detailed below.
Delta Spark
Delta Spark 3.3.1 is built on Apache Spark™ 3.5.3. Similarly to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.3.1/index.html
- API documentation: https://docs.delta.io/3.3.1/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/3.3.1/
The key fixes in this release are:
- Fix to allow user-specified on read if consistent with the table schema
- Documentation update for Row Tracking to include Row Tracking Backfill introduced in Delta 3.3
Delta Kernel
- API documentation: https://docs.delta.io/3.3.1/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The key fixes in this release are:
- Kernel fix to handle non-uniform value types in map[string, string] in delta commit files
Other projects
No fixes or changes were made in the components below in this release but the corresponding artifacts are listed.
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.3.1/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi-2.13
Delta Sharing Spark
- Documentation: https://docs.delta.io/3.3.1/delta-sharing.html
- Artifacts: delta-sharing-spark_2.12, delta-sharing-spark_2.13
Credits
Wenchen Fan, Thang Long Vu
Delta Lake 3.3.0
We are excited to announce the release of Delta Lake 3.3.0! This release includes several exciting new features.
Highlights
- [Delta Spark] Support for Identity Column to assign unique values for each record inserted into a table.
- [Delta Spark] Support VACUUM LITE to deliver faster VACUUM for periodically run VACUUM commands.
- [Delta Spark] Support for Row Tracking Backfill to alter an existing table to enable Row Tracking. Row Tracking allows engines such as Spark to track row-level lineage in Delta Lake tables.
- [Delta Spark] Support for enhanced table state validation with version checksums and improved Snapshot initialization performance based on this checksum.
- [Delta UniForm] Support for enabling UniForm Iceberg on existing tables without rewriting the data files using ALTER TABLE.
- [Delta Kernel] Support for reading Delta tables that have Type Widening enabled.
Details by each component.
Delta Spark
Delta Spark 3.3.0 is built on Apache Spark™ 3.5.3. Similarly to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.3.0/index.html
- API documentation: https://docs.delta.io/3.3.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/3.3.0/
The key features of this release are:
- Support for Identity Column: Delta Lake identity columns are a type of generated column that automatically assigns unique values to each record inserted into a table. Users do not need to explicitly provide values for these columns during data insertion. They offer a straightforward and efficient mechanism to generate unique keys for table rows, combining ease of use with high performance. See the documentation for more information.
- Support VACUUM LITE to deliver faster VACUUM for periodically run VACUUM commands. When running VACUUM in LITE mode, instead of finding all files in the table directory, VACUUM LITE uses the Delta transaction log to identify and remove files no longer referenced by any table versions within the retention duration.
- Support for Row Tracking Backfill:Row Tracking feature can now be used on existing Delta Lake tables to track row-level lineage in Delta Spark, previously it was only possible for new tables. Users can now use ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = true) syntax to alter an existing table to enable Row Tracking. When enabled, users can identify rows across multiple versions of the table and can access this tracking information using the two metadata fields
_metadata.row_id
and_metadata.row_commit_version
. Refer to the documentation on Row Tracking for more information and examples. - Delta Lake now generates version checksums for each table commit, providing stronger consistency guarantees and improved debugging capabilities. It tracks detailed table metrics including file counts, table size, data distribution histograms, etc. This enables automatic detection of potential state inconsistencies and helps maintain table integrity in distributed environments. The state validation is performed on every checkpoint. The Checksum is also used to bypass the initial Spark query that retrieves the Protocol and Metadata actions, resulting in a decreased snapshot initialization latency.
- Liquid clustering updates:
- Support OPTIMIZE FULL to fully recluster a Liquid table. This command optimizes all records in a table that uses liquid clustering, including data that might have previously been clustered.
- Support enabling liquid clustering on an existing unpartitioned Delta table using ALTER TABLE <table_name> CLUSTER BY (<clustering_columns>). Previously, liquid clustering could only be enabled upon table creation.
- Support creating clustered table from an external location
- The In-Commit Timestamp table feature is no longer in preview When enabled, this feature persists monotonically increasing timestamps within Delta commits, ensuring they are not affected by file operations. With this, time travel queries yield consistent results, even if the table directory is relocated. This feature was available as a preview feature in Delta 3.2 and is now generally available in Delta 3.3. See the documentation for more information.
Other notable changes include:
- Protocol upgrade/downgrade improvements
- Support dropping table features for columnMapping, vacuumProtocolCheck, and checkConstraints.
- Improve table protocol transitions to simplify the CUJ when altering the table protocol.
- Support protocol version downgrades when the existing table features exist in the lower protocol version.
- Update protocol upgrades behavior such that when enabling a legacy feature via a table property (e.g. setting
delta.enableChangeDataFeed=true
) the protocol is upgraded to (1,7) and only the legacy feature is enabled. Previously the minimum protocol version would be selected and all preceding legacy features enabled. - Support enabling a table feature on a table using the Python DeltaTable API with
deltaTable.addFeatureSupport(...)
.
- Type-widening improvements
- Support automatic type widening in Delta Sink when type widening is enabled on the table and schema evolution is enabled on the sink.
- Support type widening on nested fields when other nested fields in the same struct are referenced by check constraints or generated column expressions.
- Fix type-widening operation validation for map, array or struct columns used in generated column expressions or check constraints.
- Fix to directly read the file schema from the parquet footers when identifying the files to be rewritten when dropping the type widening table feature.
- Fix using type widening on a table containing a char/varchar column.
- Liquid clustering improvements
- Fix liquid clustering to automatically fall back to Z-order clustering when clustering on a single column. Previously, any attempts to optimize the table would fail.
- Support RESTORE on clustered tables. Previously, RESTORE operations would not restore clustering metadata.
- Support SHOW TBLPROPERTIES for clustered tables.
- Support for partition-like data skipping filters (preview): When enabled by setting
spark.databricks.delta.skipping.partitionLikeFilters.enabled
, applies arbitrary data skipping filters referencing Liquid clustering columns to files with the same min and max values on clustering columns. This may decrease the files scanned for selective queries on large Liquid tables.
- Performance improvements
Delta Lake 3.2.1
We are excited to announce the release of Delta Lake 3.2.1! This release contains important bug fixes to 3.2.0 and it is recommended that users upgrade to 3.2.1.
Details by each component.
Delta Spark
Delta Spark 3.2.1 is built on Apache Spark™ 3.5.3. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.2.1/index.html
- API documentation: https://docs.delta.io/3.2.1/delta-apidoc.html#delta-spark
- Artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
The key changes of this release are:
- Support for Apache Spark™ 3.5.3.
- Fix MERGE operation not being recorded in QueryExecutionListener when submitted through Scala/Python API.
- Support RESTORE on a Delta table with clustering enabled
- Fix replacing the clustered table with non-clustered table.
- Fix an issue when running clustering on table with single column selected as clustering columns.
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.2.1/delta-uniform.html
- Artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi-2.13
The key changes of this release are:
- Added the support to enable Uniform Iceberg on existing Delta tables by ALTER table instead of REORG, which rewrites data files.
- Fixed a bug that Uniform iceberg conversion transaction should not convert commit with only AddFiles without data change
Delta Sharing Spark
- Documentation: https://docs.delta.io/3.2.1/delta-sharing.html
- Artifacts: delta-sharing-spark_2.12, delta-sharing-spark_2.13
The key changes of this release are:
- Upgrade delta-sharing-client to version 1.1.1 which removes the pre-signed URL address from the error message on access errors.
- Fix an issue with DeltaSharingLogFileStatus
Delta Kernel
- API documentation: https://docs.delta.io/3.2.1/delta-kernel.html
- Artifacts: delta-kernel-api, delta-kernel-defaults
The key changes of this release are:
- Fix comparison issues with string values having characters with surrogate pairs. This fixes a corner case with wrong results when comparing characters (e.g. emojis) that have surrogate pairs in UTF-16 representation.
- Fix ClassNotFoundException issue when loading LogStores in Kernel default Engine module. This issue happens in some environments where the thread local class loader is not set.
- Fix error when querying tables with spaces in the path name. Now you can query tables with paths having any valid path characters.
- Fix an issue with writing decimal as binary when writing decimals with certain scale and precision when writing them to the Parquet file.
- Throw proper exception when unsupported VOID data type is encountered in Delta tables when reading.
- Handle long type values in field metadata of columns in schema. Earlier Kernel was throwing a parsing exception, now Kernel handles long types.
- Fix an issue where Kernel retries multiple times when _last_checkpoint file is not found. Now Kernel tries just once when file not found exception is thrown.
- Support reading Parquet files with legacy map type physical formats. Earlier Kernel used to throw errors, now Kernel can read data from file containing legacy map physical formats.
- Support reading Parquet files with legacy 3-level repeated type physical formats.
- Write timestamp data to Parquet file as INT64 physical format instead of INT96 physical format. INT96 is a legacy physical format that is deprecated.
For more information, refer to:
- User guide on step-by-step process of using Kernel in a standalone Java program or in a distributed processing connector.
- Slides explaining the rationale behind Kernel and the API design.
- Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
- Table and default Engine API Java documentation
Delta Standalone (deprecated in favor of Delta Kernel)
- API documentation: https://docs.delta.io/3.2.1/delta-standalone.html
- Artifacts:delta-standalone_2.12, delta-standalone_2.13
This release does not update Standalone. Standalone is being deprecated in favor of Delta Kernel, which supports advanced features in Delta tables.
Delta Storage
Artifacts: delta-storage, delta-storage-s3-dynamodb
The key changes of this release are:
- Fix an issue with VACUUM when using the S3DynamoDBLogStore where the LogStore made unnecessary listFrom calls to DynamoDB, causing a ProvisionedThroughputExceededException
Credits
Abhishek Radhakrishnan, Allison Portis, Charlene Lyu, Fred Storage Liu, Jiaheng Tang, Johan Lasperas, Lin Zhou, Marko Ilić, Scott Sandre, Tathagata Das, Tom van Bussel, Venki Korukanti, Wenchen Fan, Zihao Xu
Delta Lake 4.0.0 Preview
We are excited to announce the preview release of Delta Lake 4.0.0 on the preview release of Apache Spark 4.0.0! This release gives a preview of the following exciting new features.
- Support for Spark Connect (aka Delta Connect) is an extension for Spark Connect which enables the usage of Delta over Spark Connect, allowing Delta to be used with the decoupled client-server architecture of Spark Connect.
- Support for Type Widening to allow users to change the type of columns without having to rewrite data.
- Support for the Variant data type to enable semi-structured storage and data processing, for flexibility and performance.
- Support for Coordinated Commits table feature which makes the commit protocol very flexible and allows reliable multi-cloud and multi-engine writes.
Read below for more details. In addition, few existing artifacts are unavailable in this release that are listed at the end.
Delta Spark
Delta Spark 4.0 preview is built on Apache Spark™ 4.0.0-preview1. Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.
- Documentation: https://docs.delta.io/4.0.0-preview/index.html
- Maven artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-connect-client_2.13, delta-connect-common_2.13, delta-connect-server_2.13
- Python artifacts: https://pypi.org/project/delta-spark/4.0.0rc1/
The key features of this release are:
- Support for Spark Connect (aka Delta Connect): Spark Connect is a new initiative in Apache Spark that adds a decoupled client-server infrastructure which allows Spark applications to connect remotely to a Spark server and run SQL / Dataframe operations. Delta Connect allows Delta operations to be made in applications running in such client-server mode. For more information on how to use Delta Connect see the Delta Connect documentation.
- Support for Coordinated Commits: Coordinated Commits is a new writer table feature which allows users to designate a “Commit Coordinator” for their Delta table. A commit coordinator is an entity with a unique identifier which maintains information about commits. Once a commit coordinator has been set for a table, all writes to the table must be coordinated through it. This single point of ownership of commits for the table makes cross-environment (e.g. cross cloud) writes safe. Examples of Commit Coordinators are catalogs (Hive Metastore, Unity Catalog, etc.), DynamoDB, or any system which can implement the commit coordinator API. This release also adds a DynamoDB Commit Coordinator which can use a DynamoDB table to coordinate commits for a table. Delta tables with commit coordinators are still readable through the object storage paths, making reads backward compatible. See the Delta Coordinated Commits documentation for more details.
- Support for Type Widening: Delta Spark can now change the type of a column to a wider type using the
ALTER TABLE t CHANGE COLUMN col TYPE
type command or with schema evolution duringMERGE
andINSERT
operations. See the type widening documentation for a list of all supported type changes and additional information. The table will be readable by Delta 4.0 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using theALTER TABLE t DROP FEATURE 'typeWidening'
command. - Support for Variant data type: The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the Spark Variant binary encoding format. Please see the documentation and the example for more details.
Other notable changes include:
- Support protocol version downgrades when the existing table features exist in the lower protocol version.
- Support dropping table features for columnMapping and vacuumProtocolCheck.
- Support
CREATE TABLE LIKE
with user provided properties. Previously any properties that were provided in the SQL command were ignored and only the properties from the source table were used. - Fix liquid clustering to automatically fall back to Z-order clustering when clustering on a single column. Previously, any attempts to optimize the table would fail.
- Pushdown query filters when reading CDF so the filters can be used for partition pruning and row group skipping.
- Improve the performance of finding the last complete checkpoint with more efficient file listing.
- Fix a bug where providing a query filter that compares two
Literal
expressions would cause an infinite loop when constructing data skipping filters. - Fix In-Commit Timestamps to use
clock.currentTimeMillis()
instead ofSystem.nanoTime()
for large commits since some systems return a very small number whenSystem.nanoTime()
is called. - Fix streaming CDF queries to not read log entries beyond
endOffset
for reduced processing time.
More features to come in the final release of Delta 4.0!
Delta Kernel Java
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details.
This release of Delta Kernel Java contains the following changes:
- Write timestamps using the
INT64
physical format in Parquet in theDefaultParquetHandler
. Previously they were written asINT96
which is an outdated and deprecated format for timestamps. - Lazily evaluate comparator expressions in the
DefaultExpressionHandler
. Previously expressions would be eagerly evaluated for every row in the underlying vectors. - Support SQL expression
LIKE
in theDefaultExpressionHandler
. - Support legacy Parquet schemas for map type and array type in the
DefaultParquetHandler
.
In addition to the above Delta Kernel Java changes, Delta Kernel Rust released its first version 0.1, which is available at https://crates.io/crates/delta_kernel.
Limitations
The following features from Delta 3.2 are not supported in this preview release. We are working with the community to address the following gaps by the final release of Delta 4.0:
- In Delta Spark, Uniform with Iceberg and Hudi is unavailable yet due to lack of their support for Spark 4.0.
- Delta Flink, Delta Standalone, and Delta Hive are not available yet.
Credits
Abhishek Radhakrishnan, Allison Portis, Ami Oka, Andreas Chatzistergiou, Anish, Carmen Kwan, Chirag Singh, Christos Stavrakakis, Dhruv Arya, Felipe Pessoto, ...
Delta Lake 3.2.0
We are excited to announce the release of Delta Lake 3.2.0! This release includes several exciting new features.
Highlights
- Support for Liquid clustering to reduce write amplification using incremental clustering.
- Preview support for Type Widening to allow users to change the type of columns without having to rewrite data.
- Preview support for Apache Hudi in Delta UniForm tables.
Delta Spark
Delta Spark 3.2.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.2.0/index.html
- API documentation: https://docs.delta.io/3.2.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/3.2.0/
The key features of this release are:
- Support for Liquid clustering: This allows for incremental clustering based on ZCubes and reduces the write amplification by not touching files already well clustered (i.e., files in stable ZCubes). Users can now use the ALTER TABLE CLUSTER BY syntax to change clustering columns and use the DESCRIBE DETAIL command to check the clustering columns. In addition, Delta Spark now supports DeltaTable
clusterBy
API in both Python and Scala to allow creating clustered tables using DeltaTable API. See the documentation and examples for more information. - Preview support for Type Widening: Delta Spark can now change the type of a column from
byte
toshort
tointeger
using the ALTER TABLE t CHANGE COLUMN col TYPE type command or with schema evolution during MERGE and INSERT operations. The table remains readable by Delta 3.2 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using theALTER TABLE t DROP FEATURE 'typeWidening-preview’
command.- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Support for Vacuum Inventory: Delta Spark now extends the VACUUM SQL command to allow users to specify an inventory table in a VACUUM command. When an inventory table is provided, VACUUM will consider the files listed there instead of doing the full listing of the table directory, which can be time consuming for very large tables. See the docs here.
- Support for Vacuum Writer Protocol Check: Delta Spark can now support
vacuumProtocolCheck
ReaderWriter feature which ensures consistent application of reader and writer protocol checks duringVACUUM
operations, addressing potential protocol discrepancies and mitigating the risk of data corruption due to skipped writer checks. - Preview support for In-Commit Timestamps: When enabled, this preview feature persists monotonically increasing timestamps within Delta commits, ensuring they are not affected by file operations. When enabled, time travel queries will yield consistent results, even if the table directory is relocated.
- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Deletion Vectors Read Performance Improvements: Two improvements were introduced to DVs in Delta 3.2.
- Removing broadcasting of DV information to executors: This work improves stability by reducing drivers’ memory consumption, preventing potential Driver OOM for very large Delta tables like 1TB+. This work also improves performance by saving us fixed broadcasting overhead in reading small Delta Tables.
- Supporting predicate pushdown and splitting in scans with DVs: Improving performance of DV reads with filters queries thanks to predicate pushdown and splitting. This feature gains 2x performance improvement on average.
- Support for Row Tracking: Delta Spark can now write to tables that maintain information that allows identifying rows across multiple versions of a Delta table. Delta Spark can now also access this tracking information using the two metadata fields
_metadata.row_id
and_metadata.row_commit_version
.
Other notable changes include:
- Delta Sharing: reduce the minimum RPC interval in delta sharing streaming from 30 seconds to 10 seconds
- Improve the performance of write operations by skipping collecting commit stats
- New SQL configurations to specify Delta Log cache size (
spark.databricks.delta.delta.log.cacheSize
) and retention duration (spark.databricks.delta.delta.log.cacheRetentionMinutes
) - Fix bug in plan validation due to inconsistent field metadata in MERGE
- Improved metrics during VACUUM for better visibility
- Hive Metastore schema sync: The truncation threshold for schemas with long fields is now user configurable
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.2.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi_2.13
Hudi is now supported by Delta Universal format in addition to Iceberg. Writing to a Delta UniForm table can generate Hudi metadata, alongside Delta. This feature is contributed by XTable.
Create a UniForm-enabled that automatically generates Hudi metadata using the following command:
CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES ('delta.universalFormat.enabledFormats' = hudi);
See the documentation here for more details.
Other notable changes include:
- Throw a better error if Iceberg conversion fails during initial sync
- Fix a bug in Delta Universal Format to support correct table overwrites
Delta Kernel
- API documentation: https://docs.delta.io/3.2.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details). In this release,e we improved the read support to make it production-ready by adding numerous performance improvements, additional functionality, and improved protocol support.
-
Support for time travel. Now you can read a table snapshot at a version id or snapshot at a timestamp.
-
Improved Delta protocol support.
- Support for reading tab...
Delta Lake 3.1.0
We are excited to announce the release of Delta Lake 3.1.0. This release includes several exciting new features.
Few Highlights
- Delta-Spark: Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds.
- Delta-Spark: Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
- Delta-Spark: Support for querying tables shared through Delta Sharing protocol.
- Kernel: Support for data skipping for given query predicates to reduce the number of files read during the table scan.
- Uniform: Enhanced Iceberg support for Delta tables that enables MAP and LIST types and ease of use improvements to enable Uniform on a Delta table.
- Delta-Flink: Flink write job startup time latency improvement using Kernel.
Details by each component.
Delta Spark
Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.1.0/index.html
- API documentation: https://docs.delta.io/3.1.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/3.1.0/
The key features of this release are:
- Support for merge with deletion vectors to reduce the write overhead for merge operations. This feature improves the performance of merge by several folds. Refer to the documentation on deletion vectors for more information.
- Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
- (Preview) Liquid clustering for better table layout Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
- Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
- Support for Hive Metastore schema sync. Adds a mechanism for syncing the table schema to HMS. External tools can now directly consume the schema from HMS instead of accessing it from the Delta table directory. See the documentation on how to enable this feature.
- Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance. See the documentation for details on how to enable this feature.
- Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size and reduce the small file problem as data is written and benefit subsequent reads on the table. See the documentation for details on how to enable this feature.
Other notable changes include:
- Peformance improvement by removing redundant jobs when performing DML operations with deletion vectors.
- Update command now writes deletions vectors by default when the table has deletion vectors enabled.
- Support for writing partition columns to data files.
- Support for phaseout of v2 checkpoint table feature.
- Fix an issue with case-sensitive column names in Merge.
- Make VACCUM command to be Delta protocol aware so that it can only vacuum tables with protocol that it supports.
Delta Sharing Spark
- Documentation: https://docs.delta.io/3.1.0/delta-sharing.html
- Maven artifacts: delta-sharing-spark_2.12, delta-sharing-spark_2.13
This release of Delta adds a new module called delta-sharing-spark which enables reading Delta tables shared using the Delta Sharing protocol in Apache Spark™. It is migrated from https://github.com/delta-io/delta-sharing/tree/main/spark repository to https://github.com/delta-io/delta/tree/master/sharing repository. Last release version of delta-sharing-spark is 1.0.4 from the previous location. Next release of delta-sharing-spark is with the current release of Delta which is 3.1.0.
Supported read types are: read snapshot of the table, incrementally read the table using streaming or read the changes (Change Data Feed) between two versions of the table.
“Delta Format Sharing” is newly introduced since delta-sharing-spark 3.1, which supports reading shared Delta tables with advanced Delta features such as deletion vectors and column mapping.
Below is an example of reading a Delta table shared using the Delta Sharing protocol in a Spark environment. For more examples refer to the documentation.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("...")
.master("...")
.config(
"spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension"
).config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog"
).getOrCreate()
val tablePath = "<profile-file-path>#<share-name>.<schema-name>.<table-name>"
// Batch query
spark.read
.format("deltaSharing")
.option("responseFormat", "delta")
.load(tablePath)
.show(10)
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.1.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13
Delta Universal Format (UniForm) allows you to read Delta tables from Iceberg and Hudi (coming soon) clients. Delta 3.1.0 provided the following improvements:
- Enhanced Iceberg support through IcebergCompatV2. IcebergCompatV2 adds support for
LIST
andMAP
data types and improves compatibility with popular Iceberg reader clients. - Easier retrieval of the Iceberg metadata file location via familiar SQL syntax DESCRIBE EXTENDED TABLE.
- A new SQL command to enable UniForm REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)) on existing Delta tables. See the documentation for details.
- Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount which enables efficient data skipping when the tables are read as Iceberg in queries containing predicates.
Delta Kernel
- API documentation: https://docs.delta.io/3.1.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the [Delta protocol detai...
Delta Lake 3.0.0
We are excited to announce the final release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.
Highlights
Here are the most important aspects of 3.0.0:
Spark 3.5 Support
Unlike the initial preview release, Delta Spark is now built on top of Apache Spark™ 3.5. See the Delta Spark section below for more details.
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.0.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13
Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this release. UniForm takes advantage of the fact that all table storage formats, such as Delta, Iceberg, and Hudi, actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata and commits to Hive metastore, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create a UniForm-enabled table using the following command:
CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES (
'delta.universalFormat.enabledFormats' = 'iceberg');
Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details, and the key implementations here and here.
Delta Kernel
- API documentation: https://docs.delta.io/3.0.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).
You can use this library to do the following:
- Read data from Delta tables in a single thread in a single process.
- Read data from Delta tables using multiple threads in a single process.
- Build a complex connector for a distributed processing engine and read very large Delta tables.
- [soon!] Write to Delta tables from multiple threads / processes / distributed engines.
Reading a Delta table with Kernel APIs is as follows.
TableClient myTableClient = DefaultTableClient.create() ; // define a client
Table myTable = Table.forPath(myTableClient, "/delta/table/path"); // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient); // define which version of table to scan
Predicate scanFilter = ... // define the predicate
Scan myScan = mySnapshot.getScanBuilder(myTableClient) // specify the scan details
.withFilters(scanFilter)
.build();
Scan.readData(...) // returns the table data
Full example code can be found here.
For more information, refer to:
- User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
- Slides explaining the rationale behind Kernel and the API design.
- Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
- Table and default TableClient API Java documentation
This release of Delta contains the Kernel Table API and default TableClient API definitions and implementation which allow:
- Reading Delta tables with optional Deletion Vectors enabled or column mapping (name mode only) enabled.
- Partition pruning optimization to reduce the number of data files to read.
Welcome Delta Connectors to the Delta repository!
All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.
Delta Spark
Delta Spark 3.0.0 is built on top of Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.
- Documentation: https://docs.delta.io/3.0.0/index.html
- API documentation: https://docs.delta.io/3.0.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb, delta-iceberg_2.12, delta-iceberg_2.13
- Python artifacts: https://pypi.org/project/delta-spark/3.0.0/
The key features of this release are:
- Support for Apache Spark 3.5
- Delta Universal Format - Write as Delta, read as Iceberg! See the highlighted section above.
- Up to 10x performance improvement of UPDATE using Deletion Vectors - Delta UPDATE operations now support writing Deletion Vectors. When enabled, the performance of UPDATEs will receive a significant boost.
- More than 2x performance improvement of DELETE using Deletion Vectors - This fix improves the file path canonicalization logic by avoiding calling expensive
Path.toUri.toString
calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table). - Up to 2x faster MERGE operation - MERGE now better leverages data skipping, the ability to use the insert-only code path in more cases, and an overall improved execution to achieve up to 2x better performance in various scenarios.
- Support streaming reads from column mapping enabled tables when
DROP COLUMN
andRENAME COLUMN
have been used. This includes streaming support for Change Data Feed. See the documentation here for more details. - Support specifying the columns for which Delta will collect file-skipping statistics via the table property
delta.dataSkippingStatsColumns
. Previously, Delta would only collect file-skipping statistics for the first N columns in the table schema (default to 32). Now, users can easily customize this. - Support zero-copy convert to Delta from Iceberg tables on Apache Spark 3.5 using
CONVERT TO DELTA
. This feature was excluded from the Delta Lake 2.4 release since Iceberg did not yet support Apache Spark 3.4 (or 3.5). This command generates a Delta table in the same location and does not rewrite any parquet files. - Checkpoint V2 - Introduced a new Checkpoint V2 format in Delta Protocol Specification and implemented read/write support in Delta Spark. The new checkpoint v2 format provides more reliability over the existing v1 checkpoint format.
- Log Compactions - Introduced new log compaction files in Delta Protocol Specification which could be useful in reducing the frequency of Delta checkpoints. Added read support for log compaction files in Delta Spark.
- Safe casts enabled by default for UPDATE and MERGE operations - Delta UPDATE and MERGE operations now result in an error when values cannot be safely ca...
Delta Lake 2.4.0
We are excited to announce the release of Delta Lake 2.4.0 on Apache Spark 3.4. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.4.0/
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/2.4.0/
The key features in this release are as follows
- Support for Apache Spark 3.4.
- Support writing Deletion Vectors for the
DELETE
command. Previously, when deleting rows from a Delta table, any file with at least one matching row would be rewritten. With Deletion Vectors these expensive rewrites can be avoided. See What are deletion vectors? for more details. - Support for all write operations on tables with Deletion Vectors enabled.
- Support
PURGE
to remove Deletion Vectors from the current version of a Delta table by rewriting any data files with deletion vectors. See the documentation for more details. - Support reading Change Data Feed for tables with Deletion Vectors enabled.
- Support
REPLACE WHERE
expressions in SQL to selectively overwrite data. Previously “replaceWhere” options were only supported in the DataFrameWriter APIs. - Support
WHEN NOT MATCHED BY SOURCE
clauses in SQL for the Merge command. - Support omitting generated columns from the column list for SQL
INSERT INTO
queries. Delta will automatically generate the values for any unspecified generated columns. - Support the
TimestampNTZ
data type added in Spark 3.3. UsingTimestampNTZ
requires a Delta protocol upgrade; see the documentation for more information. - Other notable changes
- Increased resiliency for S3 multi-cluster reads and writes.
- Allow changing the column type of a
char
orvarchar
column to a compatible type in theALTER TABLE
command. The new behavior is the same as in Apache Spark and allows upcasting fromchar
orvarchar
tovarchar
orstring
. - Block using
overwriteSchema
with dynamic partition overwrite. This can corrupt the table as not all the data may be removed, and the schema of the newly written partitions may not match the schema of the unchanged partitions. - Return an empty
DataFrame
for Change Data Feed reads when there are no commits within the timestamp range provided. Previously an error would be thrown. - Fix a bug in Change Data Feed reads for records created during the ambiguous hour when daylight savings occurs.
- Fix a bug where querying an external Delta table at the root of an S3 bucket would throw an error.
- Remove leaked internal Spark metadata from the Delta log to make any affected tables readable again.
Note: the Delta Lake 2.4.0 release does not include the Iceberg to Delta converter because iceberg-spark-runtime
does not support Spark 3.4 yet. The Iceberg to Delta converter is still supported when using Delta 2.3 with Spark 3.3.
Credits
Alkis Evlogimenos, Allison Portis, Andreas Chatzistergiou, Anton Okolnychyi, Bart Samwel, Bo Gao, Carl Fu, Chaoqin Li, Christos Stavrakakis, David Lewis, Desmond Cheong, Dhruv Shah, Eric Maynard, Fred Liu, Fredrik Klauss, Haejoon Lee, Hussein Nagree, Jackie Zhang, Jintian Liang, Johan Lasperas, Lars Kroll, Lukas Rupprecht, Matthew Powers, Ming DAI, Ming Dai, Naga Raju Bhanoori, Paddy Xu, Prakhar Jain, Rahul Shivu Mahadev, Rui Wang, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wenchen Fan, Xi Liang, Yaohua Zhao, Yuming Wang