Skip to content

Conversation

@zilto
Copy link
Collaborator

@zilto zilto commented Aug 19, 2025

Related Issues

Core changes

  1. created new connection pool class and separated borrow/return conn from duckdb credentials.
  2. connection pool can create new connections for every thread. ducklake does not work with default parallellism (the other thread always locks)
  3. changes in derivation structure of DuckDbBaseCredentials: it is decoupled from ConnectionStringCredentials and used to configure connection options, pragmas, extensions and host connection pool. reason: ducklake does not need ConnectionStringCredentials at the top level. it only needs catalog_name. Overall this looks much nicer
  4. connection pool still is still part of DuckDbBaseCredentials, Reason: this is the only singleton persisted during the whole load step. Correct implementation: allow for connection pool as top level entity. For now: good enough
  5. a few bugs and undefined behaviors came out in configuration resolver. reason: ducklake configuration is complicated and modular ie. for the first time 3rd and 4th config embedding level goes to production (ducklake configuration -> ducklake credentials -> filesystem configuration -> filesystem credentials). fixes:
  • first unresolved embedded configuration was stopping the resolution immediately. now all configurations are resolved till the end. example: catalog credentials were missing but filesystem credentials were configured. with the bug filesystem credentials were never resolved and for both defaults were used.
  • as a consequence lookup traces that are collected are not recursive. previously only lookup traces from failing embedded configuration were collected. now we collect all of them
  • error message is IMO way nicer and shows embedded traces. also shows the successfully resolved fields so user sees that actually something worked and something not
  1. there's a mechanism in dlt that points all relative paths to data (ie. duckdb database, local filesystem data). I cleaned it up and make it easy to propagate redirects to embedded configurations (ie. now both catalog and storage are redirected). Still I'm not happy with how it is implemented... I'll give it another run in the future
  2. There are docs for ducklake. Reading docs will help with the review

@netlify
Copy link

netlify bot commented Aug 19, 2025

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 531a1de
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/68d2d5b502c65c0008ca0af3
😎 Deploy Preview https://deploy-preview-3015--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really good! here's summary of my suggestion:

  • simplify ducklake credentials class (ie. remove __init__, implement _conn_str()
  • load extensions in borrow_conn
  • we'll need to tweak how connections are opened in ibis handover (but that's easy)

return self.database == ":pipeline:"

def on_resolved(self) -> None:
# TODO Why don't we support `:memory:` string?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we support it. you can pass duckdb instance instead of credentials and destination factory will use it:
https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb#destination-configuration (those docs will benefit from better section titles)

:memory: database is wiped out when connection is closed. during the loading the connection will be opened and closed several times. ie. to migrate schemas. and at the end all the data will be lost because we close all connection when loader exits



# NOTE duckdb extensions are only loaded when using the dlt cursor. They are not
# loaded when using the native connection (e.g., when passing it to Ibis)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a mechanism to load extensions at start. it could be made easier for implementers but right now you can update extensions in on_resolve of DuckLakeCredentials(DuckDbBaseCredentials) (that you implement below).

some docs: https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb#additional-configuration

another option you have is to subclass sql_client. see the base class.

class DuckDbSqlClient(SqlClientBase[duckdb.DuckDBPyConnection], DBTransaction):
    dbapi: ClassVar[DBApi] = duckdb

    def __init__(
        self,
        dataset_name: str,
        staging_dataset_name: str,
        credentials: DuckDbBaseCredentials,
        capabilities: DestinationCapabilitiesContext,
    ) -> None:
        super().__init__(None, dataset_name, staging_dataset_name, capabilities)
        self._conn: duckdb.DuckDBPyConnection = None
        self.credentials = credentials
        # set additional connection options so derived class can change it
        # TODO: move that to methods that can be overridden, include local_config
        self._pragmas = ["enable_checkpoint_on_shutdown"]
        self._global_config: Dict[str, Any] = {
            "TimeZone": "UTC",
            "checkpoint_threshold": "1gb",
        }

    @raise_open_connection_error
    def open_connection(self) -> duckdb.DuckDBPyConnection:
        self._conn = self.credentials.borrow_conn(
            pragmas=self._pragmas,
            global_config=self._global_config,
            local_config={
                "search_path": self.fully_qualified_dataset_name(),
            },
        )
        return self._conn

and inject extensions on init or when connection is being opened

self.memory_db = None


def _install_extension(duckdb_sql_client: DuckDbSqlClient, extension_name: LiteralString) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhmmm I think the code that adds extensions in borrow_conn will suffice. if not we can move those utils there?

class DuckLakeCredentials(DuckDbCredentials):
def __init__(
self,
# TODO how does duckdb resolve the name of the database to the name of the dataset / pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's something that I may not fully grasp. but DuckLakeCredentials will create :memory: instance

  • to which you attach catalog below
  • to which you attach storage
  • that gets configured with extensions and settings in DuckLakeCredentials (self)
  • and this instance DuckLakeCredentials is used to borrow_con

so what should assume dataset_name here? catalog database if it is dukcdb? pls see below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the default case, here's what I'm currently aiming for:

pipeline = dlt.pipeline("jaffle_shop", destination="ducklake")
pipeline.run(...)
  • a duckdb instance is created in :memory:; we call it the ducklake_client
  • the ducklake_client installs the ducklake extension for duckdb (needs to be done once per system)
  • the ducklake_client uses the ATTACH command to load a catalog and storage
  • the catalog is a duckdb instance on disk (with extension .ducklake instead of .duckdb by convention)
  • the default storage is completely handled by DuckDB / DuckLake

The outcome is

|- pipeline.py
|- jaffle_shop.ducklake  # catalog file (if duckdb or sqlite)
|- jaffle_shop.ducklake.files/  # storage
   |- main/  # schema level
      |- customers/  # table level
          |- data.parquet  # data
      |- orders/

Design

  • The DuckLakeCredentials inherits from DuckDbCredentials and the "main" credentials are used to define the ducklake_client
  • We always use an in-memory DuckDB connection for the ducklake_client

# TODO how does duckdb resolve the name of the database to the name of the dataset / pipeline
ducklake_name: str = "ducklake",
*,
catalog_database: Optional[Union[ConnectionStringCredentials, DuckDbCredentials]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres, mysql, duckdb, motherduck are all ConnectionStringCredentials so maybe that's enough to put here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use drivername to distinguish them

return caps


# TODO support connecting to a snapshot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be amazing but we can do that later. snapshots mean reproducible local environments that you can get with 0 copy

attach_statement = f"ATTACH IF NOT EXISTS 'ducklake:{ducklake_name}.ducklake'"
if storage:
# TODO handle storage credentials by creating secrets
attach_statement += f" (DATA_PATH {storage.bucket_url})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should pass storage to create_secret before you attach (after you open the connection)

)


def test_native_duckdb_workflow(tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to do a few "smoke tests". the next step would be to enable ducklake to be tested for exactly the same tests as duckdb using ie. local duckdb as catalog and local filesystem as storage.

let's do another iteration of this ticket and then I'll look at this. I was able to do the same with iceberg destination so I'm pretty sure it will work



# TODO add connection to a specific snapshot
# TODO does it make sense for ducklake to have a staging destination?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point see here: #1692


return DuckLakeClient

def _raw_capabilities(self) -> DestinationCapabilitiesContext:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: ducklake will support upsert (MERGE INTO) so we can enable this strategy to see if it works

@zilto zilto force-pushed the feat/ducklake-destination branch from cac1f1d to d342f77 Compare August 26, 2025 12:38
@rudolfix
Copy link
Collaborator

there were pretty complicated issues with parallel loading, also depended on catalog types. I underestimated the effort a little when writing original ticket... luckily it seems that most of the work is done. below is a description of changes I've made. also see the commit log:

  1. created new connection pool class and separated borrow/return conn from duckdb credentials. it got too convoluted
  2. connection pool can create new connections for every thread. ducklake does not work with default paralellism (the other thread always locks)
  3. enabled standard destination tests (well, they were already enabled), just 2-3 tests are not passing, to be investigated
  4. if duckdb and sqllite are configured as catalogs, load jobs are sequential
  5. I added ibis handling. seems to work.
  6. support for sqllite, duckdb, postgres (with a simpe test). emits storage secrets (no tests), mysql support (no tests)

what is left:

  1. tests: best if we run ducklake tests again as separate remote tests with remote postgres catalog + do a few smoke tests on supported buckets
  2. motherduck catalog is not supported
  3. open table interface is not supported (ie. to compact the tables)
  4. I left most the commented code and module docstrings. may need cleanup
  5. we are missing docs

let's touch base how and when to continue here

@rudolfix rudolfix force-pushed the feat/ducklake-destination branch from a61e94b to 3cac519 Compare August 31, 2025 12:49
@rudolfix rudolfix assigned rudolfix and unassigned zilto Sep 10, 2025
@rudolfix rudolfix force-pushed the feat/ducklake-destination branch from 99b704a to 9e1c2eb Compare September 21, 2025 11:22
@rudolfix rudolfix added the ci full Use to trigger CI on a PR for full load tests label Sep 21, 2025
@rudolfix rudolfix marked this pull request as ready for review September 21, 2025 18:05
@rudolfix rudolfix requested a review from burnash September 22, 2025 18:43
url = url._replace(query=None)
# we only have control over netloc/path
return url.render_as_string(hide_password=True)
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the exception be more specific here? Could swallowing all exceptions hide potential issues?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix You should fix directly because I don't know what type of exceptions you expect

@rudolfix rudolfix merged commit 8565a2a into devel Sep 24, 2025
132 of 139 checks passed
Copy link
Collaborator

@burnash burnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix please see the comments.

# NOTE: database must be detached otherwise it is left in inconsistent state
# TODO: perhaps move attach/detach to connection pool
self._conn.execute(self.attach_statement)
self._conn.execute(f"USE {self.credentials.catalog_name};")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use escape_identifier() instead of passing a bare catalog_name?

Copy link
Collaborator Author

@zilto zilto Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that it's better to escape the identifier. We should also improve assertions to check that catalog_name is a valid DuckDB SQL identifier here. i.e., there's no funny character in catalog_name

@zilto
Copy link
Collaborator Author

zilto commented Sep 30, 2025

Adding here for legacy: sqlite3 in Python has deep-rooted issues with concurrency.

Currently, the sqlite3 DB-API 2.0 attribute 'threadsafety' is hard-coded to 1, meaning "threads may share the module, but not connections". This is not always true, since it depends on the default SQLite threaded mode, selected at compile-time with the SQLITE_THREADSAFE define.

The parameter SQLITE_THREADSAFE can't be changed for Python < 3.11 (reference)

@zilto zilto deleted the feat/ducklake-destination branch September 30, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci full Use to trigger CI on a PR for full load tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants