feat: `ducklake` destination #3015

zilto · 2025-08-19T20:10:22Z

Related Issues

issue: DuckLake support #2709

Core changes

created new connection pool class and separated borrow/return conn from duckdb credentials.
connection pool can create new connections for every thread. ducklake does not work with default parallellism (the other thread always locks)
changes in derivation structure of DuckDbBaseCredentials: it is decoupled from ConnectionStringCredentials and used to configure connection options, pragmas, extensions and host connection pool. reason: ducklake does not need ConnectionStringCredentials at the top level. it only needs catalog_name. Overall this looks much nicer
connection pool still is still part of DuckDbBaseCredentials, Reason: this is the only singleton persisted during the whole load step. Correct implementation: allow for connection pool as top level entity. For now: good enough
a few bugs and undefined behaviors came out in configuration resolver. reason: ducklake configuration is complicated and modular ie. for the first time 3rd and 4th config embedding level goes to production (ducklake configuration -> ducklake credentials -> filesystem configuration -> filesystem credentials). fixes:

first unresolved embedded configuration was stopping the resolution immediately. now all configurations are resolved till the end. example: catalog credentials were missing but filesystem credentials were configured. with the bug filesystem credentials were never resolved and for both defaults were used.
as a consequence lookup traces that are collected are not recursive. previously only lookup traces from failing embedded configuration were collected. now we collect all of them
error message is IMO way nicer and shows embedded traces. also shows the successfully resolved fields so user sees that actually something worked and something not

there's a mechanism in dlt that points all relative paths to data (ie. duckdb database, local filesystem data). I cleaned it up and make it easy to propagate redirects to embedded configurations (ie. now both catalog and storage are redirected). Still I'm not happy with how it is implemented... I'll give it another run in the future
There are docs for ducklake. Reading docs will help with the review

netlify · 2025-08-19T20:11:50Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`531a1de`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/68d2d5b502c65c0008ca0af3
😎 Deploy Preview	https://deploy-preview-3015--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

rudolfix

this looks really good! here's summary of my suggestion:

simplify ducklake credentials class (ie. remove __init__, implement _conn_str()
load extensions in borrow_conn
we'll need to tweak how connections are opened in ibis handover (but that's easy)

rudolfix · 2025-08-20T08:54:31Z

dlt/destinations/impl/duckdb/configuration.py

        return self.database == ":pipeline:"

    def on_resolved(self) -> None:
+        # TODO Why don't we support `:memory:` string?


we support it. you can pass duckdb instance instead of credentials and destination factory will use it:
https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb#destination-configuration (those docs will benefit from better section titles)

:memory: database is wiped out when connection is closed. during the loading the connection will be opened and closed several times. ie. to migrate schemas. and at the end all the data will be lost because we close all connection when loader exits

rudolfix · 2025-08-20T09:03:08Z

dlt/destinations/impl/duckdb/configuration.py



+# NOTE duckdb extensions are only loaded when using the dlt cursor. They are not
+# loaded when using the native connection (e.g., when passing it to Ibis)


there's a mechanism to load extensions at start. it could be made easier for implementers but right now you can update extensions in on_resolve of DuckLakeCredentials(DuckDbBaseCredentials) (that you implement below).

some docs: https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb#additional-configuration

another option you have is to subclass sql_client. see the base class.

class DuckDbSqlClient(SqlClientBase[duckdb.DuckDBPyConnection], DBTransaction): dbapi: ClassVar[DBApi] = duckdb def __init__( self, dataset_name: str, staging_dataset_name: str, credentials: DuckDbBaseCredentials, capabilities: DestinationCapabilitiesContext, ) -> None: super().__init__(None, dataset_name, staging_dataset_name, capabilities) self._conn: duckdb.DuckDBPyConnection = None self.credentials = credentials # set additional connection options so derived class can change it # TODO: move that to methods that can be overridden, include local_config self._pragmas = ["enable_checkpoint_on_shutdown"] self._global_config: Dict[str, Any] = { "TimeZone": "UTC", "checkpoint_threshold": "1gb", } @raise_open_connection_error def open_connection(self) -> duckdb.DuckDBPyConnection: self._conn = self.credentials.borrow_conn( pragmas=self._pragmas, global_config=self._global_config, local_config={ "search_path": self.fully_qualified_dataset_name(), }, ) return self._conn

and inject extensions on init or when connection is being opened

rudolfix · 2025-08-20T09:03:46Z

dlt/destinations/impl/duckdb/sql_client.py

            self.memory_db = None
+
+
+def _install_extension(duckdb_sql_client: DuckDbSqlClient, extension_name: LiteralString) -> None:


mhmmm I think the code that adds extensions in borrow_conn will suffice. if not we can move those utils there?

rudolfix · 2025-08-20T09:24:15Z

dlt/destinations/impl/ducklake/configuration.py

+class DuckLakeCredentials(DuckDbCredentials):
+    def __init__(
+        self,
+        # TODO how does duckdb resolve the name of the database to the name of the dataset / pipeline


here's something that I may not fully grasp. but DuckLakeCredentials will create :memory: instance

to which you attach catalog below

to which you attach storage

that gets configured with extensions and settings in DuckLakeCredentials (self)

and this instance DuckLakeCredentials is used to borrow_con

so what should assume dataset_name here? catalog database if it is dukcdb? pls see below

For the default case, here's what I'm currently aiming for:

pipeline = dlt.pipeline("jaffle_shop", destination="ducklake") pipeline.run(...)

a duckdb instance is created in :memory:; we call it the ducklake_client

the ducklake_client installs the ducklake extension for duckdb (needs to be done once per system)

the ducklake_client uses the ATTACH command to load a catalog and storage

the catalog is a duckdb instance on disk (with extension .ducklake instead of .duckdb by convention)

the default storage is completely handled by DuckDB / DuckLake

The outcome is

|- pipeline.py |- jaffle_shop.ducklake # catalog file (if duckdb or sqlite) |- jaffle_shop.ducklake.files/ # storage |- main/ # schema level |- customers/ # table level |- data.parquet # data |- orders/

Design

The DuckLakeCredentials inherits from DuckDbCredentials and the "main" credentials are used to define the ducklake_client

We always use an in-memory DuckDB connection for the ducklake_client

rudolfix · 2025-08-20T09:25:02Z

dlt/destinations/impl/ducklake/configuration.py

+        # TODO how does duckdb resolve the name of the database to the name of the dataset / pipeline
+        ducklake_name: str = "ducklake",
+        *,
+        catalog_database: Optional[Union[ConnectionStringCredentials, DuckDbCredentials]] = None,


postgres, mysql, duckdb, motherduck are all ConnectionStringCredentials so maybe that's enough to put here

you can use drivername to distinguish them

rudolfix · 2025-08-20T11:53:42Z

dlt/destinations/impl/ducklake/configuration.py

+    return caps
+
+
+# TODO support connecting to a snapshot


that would be amazing but we can do that later. snapshots mean reproducible local environments that you can get with 0 copy

rudolfix · 2025-08-20T11:54:56Z

dlt/destinations/impl/ducklake/configuration.py

+        attach_statement = f"ATTACH IF NOT EXISTS 'ducklake:{ducklake_name}.ducklake'"
+        if storage:
+            # TODO handle storage credentials by creating secrets
+            attach_statement += f" (DATA_PATH {storage.bucket_url})"


you should pass storage to create_secret before you attach (after you open the connection)

rudolfix · 2025-08-20T11:59:14Z

tests/load/ducklake/test_ducklake_client.py

+)
+
+
+def test_native_duckdb_workflow(tmp_path):


makes sense to do a few "smoke tests". the next step would be to enable ducklake to be tested for exactly the same tests as duckdb using ie. local duckdb as catalog and local filesystem as storage.

let's do another iteration of this ticket and then I'll look at this. I was able to do the same with iceberg destination so I'm pretty sure it will work

rudolfix · 2025-08-20T12:01:10Z

dlt/destinations/impl/ducklake/configuration.py

+
+
+# TODO add connection to a specific snapshot
+# TODO does it make sense for ducklake to have a staging destination?


good point see here: #1692

rudolfix · 2025-08-20T12:03:15Z

dlt/destinations/impl/ducklake/factory.py

+
+        return DuckLakeClient
+
+    def _raw_capabilities(self) -> DestinationCapabilitiesContext:


note: ducklake will support upsert (MERGE INTO) so we can enable this strategy to see if it works

…ile names, allows to copy local file context in WithLocalFiles

…ng when connections opened in duckdb, improves error handling if commit tx fails

… (2) point all local files to local_dir (3) allow various urls to configure ducklake name (4) uses parquet as default file format

…catalogs

…open_connection which provides full context)

…talogs

rudolfix · 2025-08-31T10:54:30Z

there were pretty complicated issues with parallel loading, also depended on catalog types. I underestimated the effort a little when writing original ticket... luckily it seems that most of the work is done. below is a description of changes I've made. also see the commit log:

created new connection pool class and separated borrow/return conn from duckdb credentials. it got too convoluted
connection pool can create new connections for every thread. ducklake does not work with default paralellism (the other thread always locks)
enabled standard destination tests (well, they were already enabled), just 2-3 tests are not passing, to be investigated
if duckdb and sqllite are configured as catalogs, load jobs are sequential
I added ibis handling. seems to work.
support for sqllite, duckdb, postgres (with a simpe test). emits storage secrets (no tests), mysql support (no tests)

what is left:

tests: best if we run ducklake tests again as separate remote tests with remote postgres catalog + do a few smoke tests on supported buckets
motherduck catalog is not supported
open table interface is not supported (ie. to compact the tables)
I left most the commented code and module docstrings. may need cleanup
we are missing docs

let's touch base how and when to continue here

…d name in other cases

docs/website/docs/dlt-ecosystem/destinations/duckdb.md

burnash · 2025-09-23T16:33:06Z

dlt/common/configuration/specs/connection_string_credentials.py

+            url = url._replace(query=None)
+            # we only have control over netloc/path
+            return url.render_as_string(hide_password=True)
+        except Exception:


Should the exception be more specific here? Could swallowing all exceptions hide potential issues?

@rudolfix You should fix directly because I don't know what type of exceptions you expect

burnash

@rudolfix please see the comments.

burnash · 2025-09-25T09:19:55Z

dlt/destinations/impl/ducklake/sql_client.py

+            # NOTE: database must be detached otherwise it is left in inconsistent state
+            # TODO: perhaps move attach/detach to connection pool
+            self._conn.execute(self.attach_statement)
+            self._conn.execute(f"USE {self.credentials.catalog_name};")


Should this use escape_identifier() instead of passing a bare catalog_name?

You're right that it's better to escape the identifier. We should also improve assertions to check that catalog_name is a valid DuckDB SQL identifier here. i.e., there's no funny character in catalog_name

dlt/destinations/impl/ducklake/sql_client.py

dlt/common/configuration/exceptions.py

dlt/destinations/impl/ducklake/sql_client.py

zilto · 2025-09-30T16:43:59Z

Adding here for legacy: sqlite3 in Python has deep-rooted issues with concurrency.

Currently, the sqlite3 DB-API 2.0 attribute 'threadsafety' is hard-coded to 1, meaning "threads may share the module, but not connections". This is not always true, since it depends on the default SQLite threaded mode, selected at compile-time with the SQLITE_THREADSAFE define.

The parameter SQLITE_THREADSAFE can't be changed for Python < 3.11 (reference)

rudolfix requested changes Aug 20, 2025

View reviewed changes

rudolfix assigned zilto Aug 26, 2025

zilto added 9 commits August 26, 2025 08:37

move duckdb capabilities to utility function

5c11e25

add basic DuckLake files based on DuckDB / Motherduck

c293c5f

refactor ducklake config

a6e529f

wip; ducklake destination

ec9a866

simplified testing

ee0e469

ignore ducklake files

d0896f8

completed default config; TODO fix write

54f59e7

unicode issues

788066b

commented out patches

d342f77

zilto force-pushed the feat/ducklake-destination branch from cac1f1d to d342f77 Compare August 26, 2025 12:38

zilto and others added 11 commits August 26, 2025 08:45

lint

bca2829

Merge branch 'devel' into feat/ducklake-destination

862de51

uses destination_type as final fallback when creating default local f…

93728d5

…ile names, allows to copy local file context in WithLocalFiles

creates connection pool for duckdb

49f807f

fixes exception handling in open_connection in sql_client, fixes raci…

895b663

…ng when connections opened in duckdb, improves error handling if commit tx fails

handles ducklake attach/detach in sql_client

5921970

modifes ducklake configuration to: (1) use sqllite as default catalog…

9019d5e

… (2) point all local files to local_dir (3) allow various urls to configure ducklake name (4) uses parquet as default file format

adjust caps to execute load jobs sequentially for duckdb and sqllite …

1d7026a

…catalogs

passes ducklake conn to ibis, improves how duckb conn is passed (via …

b41b965

…open_connection which provides full context)

adds configuration and credential tests, smoke tests for supported ca…

d47ad6a

…talogs

enables ducklake on ci

e0e52df

fixes ducklake imports

3cac519

rudolfix force-pushed the feat/ducklake-destination branch from a61e94b to 3cac519 Compare August 31, 2025 12:49

rudolfix assigned rudolfix and unassigned zilto Sep 10, 2025

fixes how secrets are created from filesystem

3a1ed4a

rudolfix added 11 commits September 20, 2025 10:59

enabled ducklake remote test

938830a

improves ibis filesystem con handover, enables databricks

5a7701f

fixes tests

0cdcda2

fixes lancedb default name

cfdcb1d

propagates only top level config section, replaces with embedded fiel…

9c3fe82

…d name in other cases

adds tests and examples for programmatic creation of ducklake facotry

117cc1e

adds merge selector in duckdb caps to enable upsert on 1.4

979de62

ducklake code cleanups

786f214

makes sure pipeline is dropped before run_context goes out of scope

63ddb4b

finalizes ducklake docs

6265d62

fallback in duckdb merge selector if duckdb not installed

9e1c2eb

rudolfix force-pushed the feat/ducklake-destination branch from 99b704a to 9e1c2eb Compare September 21, 2025 11:22

rudolfix added the ci full Use to trigger CI on a PR for full load tests label Sep 21, 2025

rudolfix added 3 commits September 21, 2025 20:04

propagates persist_secret flag in filesystem sql client

69a1983

fixes tests and ci

d7dd422

Merge branch 'devel' into feat/ducklake-destination

13943a3

rudolfix marked this pull request as ready for review September 21, 2025 18:05

rudolfix added 2 commits September 22, 2025 13:07

runs remote ducklake on local postgres catalog for low latency

43de1b2

uses packaging version, not semver for python packages comparisons

b686288

rudolfix requested a review from burnash September 22, 2025 18:43

burnash reviewed Sep 23, 2025

View reviewed changes

docs/website/docs/dlt-ecosystem/destinations/duckdb.md Outdated Show resolved Hide resolved

Update docs/website/docs/dlt-ecosystem/destinations/duckdb.md

ff2af93

burnash reviewed Sep 23, 2025

View reviewed changes

rudolfix added 2 commits September 23, 2025 19:07

Merge branch 'devel' into feat/ducklake-destination

1c0d90b

fixes recursive re-raise in sql_client

531a1de

rudolfix merged commit 8565a2a into devel Sep 24, 2025
132 of 139 checks passed

burnash reviewed Sep 25, 2025

View reviewed changes

rudolfix mentioned this pull request Sep 30, 2025

feat(ducklake): Disambiguate DuckLake configuration names #3140

Closed

zilto deleted the feat/ducklake-destination branch September 30, 2025 16:48



		# NOTE duckdb extensions are only loaded when using the dlt cursor. They are not
		# loaded when using the native connection (e.g., when passing it to Ibis)

		self.memory_db = None


		def _install_extension(duckdb_sql_client: DuckDbSqlClient, extension_name: LiteralString) -> None:



		# TODO add connection to a specific snapshot
		# TODO does it make sense for ducklake to have a staging destination?


		return DuckLakeClient

		def _raw_capabilities(self) -> DestinationCapabilitiesContext:

feat: ducklake destination #3015

feat: ducklake destination #3015

Uh oh!

Conversation

zilto commented Aug 19, 2025 • edited by rudolfix Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Core changes

Uh oh!

netlify bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs ready!

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Design

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudolfix commented Aug 31, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

burnash left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zilto Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zilto commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: `ducklake` destination #3015

feat: `ducklake` destination #3015

zilto commented Aug 19, 2025 •

edited by rudolfix

Loading

netlify bot commented Aug 19, 2025 •

edited

Loading

zilto Sep 30, 2025 •

edited

Loading