-
Notifications
You must be signed in to change notification settings - Fork 415
feat: ducklake destination
#3015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really good! here's summary of my suggestion:
- simplify ducklake credentials class (ie. remove
__init__, implement_conn_str() - load extensions in
borrow_conn - we'll need to tweak how connections are opened in ibis handover (but that's easy)
| return self.database == ":pipeline:" | ||
|
|
||
| def on_resolved(self) -> None: | ||
| # TODO Why don't we support `:memory:` string? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we support it. you can pass duckdb instance instead of credentials and destination factory will use it:
https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb#destination-configuration (those docs will benefit from better section titles)
:memory: database is wiped out when connection is closed. during the loading the connection will be opened and closed several times. ie. to migrate schemas. and at the end all the data will be lost because we close all connection when loader exits
|
|
||
|
|
||
| # NOTE duckdb extensions are only loaded when using the dlt cursor. They are not | ||
| # loaded when using the native connection (e.g., when passing it to Ibis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a mechanism to load extensions at start. it could be made easier for implementers but right now you can update extensions in on_resolve of DuckLakeCredentials(DuckDbBaseCredentials) (that you implement below).
some docs: https://dlthub.com/docs/dlt-ecosystem/destinations/duckdb#additional-configuration
another option you have is to subclass sql_client. see the base class.
class DuckDbSqlClient(SqlClientBase[duckdb.DuckDBPyConnection], DBTransaction):
dbapi: ClassVar[DBApi] = duckdb
def __init__(
self,
dataset_name: str,
staging_dataset_name: str,
credentials: DuckDbBaseCredentials,
capabilities: DestinationCapabilitiesContext,
) -> None:
super().__init__(None, dataset_name, staging_dataset_name, capabilities)
self._conn: duckdb.DuckDBPyConnection = None
self.credentials = credentials
# set additional connection options so derived class can change it
# TODO: move that to methods that can be overridden, include local_config
self._pragmas = ["enable_checkpoint_on_shutdown"]
self._global_config: Dict[str, Any] = {
"TimeZone": "UTC",
"checkpoint_threshold": "1gb",
}
@raise_open_connection_error
def open_connection(self) -> duckdb.DuckDBPyConnection:
self._conn = self.credentials.borrow_conn(
pragmas=self._pragmas,
global_config=self._global_config,
local_config={
"search_path": self.fully_qualified_dataset_name(),
},
)
return self._connand inject extensions on init or when connection is being opened
| self.memory_db = None | ||
|
|
||
|
|
||
| def _install_extension(duckdb_sql_client: DuckDbSqlClient, extension_name: LiteralString) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhmmm I think the code that adds extensions in borrow_conn will suffice. if not we can move those utils there?
| class DuckLakeCredentials(DuckDbCredentials): | ||
| def __init__( | ||
| self, | ||
| # TODO how does duckdb resolve the name of the database to the name of the dataset / pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's something that I may not fully grasp. but DuckLakeCredentials will create :memory: instance
- to which you attach
catalogbelow - to which you attach
storage - that gets configured with extensions and settings in
DuckLakeCredentials(self) - and this instance
DuckLakeCredentialsis used to borrow_con
so what should assume dataset_name here? catalog database if it is dukcdb? pls see below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the default case, here's what I'm currently aiming for:
pipeline = dlt.pipeline("jaffle_shop", destination="ducklake")
pipeline.run(...)- a
duckdbinstance is created in:memory:; we call it theducklake_client - the
ducklake_clientinstalls theducklakeextension for duckdb (needs to be done once per system) - the
ducklake_clientuses theATTACHcommand to load acatalogandstorage - the
catalogis a duckdb instance on disk (with extension.ducklakeinstead of.duckdbby convention) - the default
storageis completely handled by DuckDB / DuckLake
The outcome is
|- pipeline.py
|- jaffle_shop.ducklake # catalog file (if duckdb or sqlite)
|- jaffle_shop.ducklake.files/ # storage
|- main/ # schema level
|- customers/ # table level
|- data.parquet # data
|- orders/
Design
- The
DuckLakeCredentialsinherits fromDuckDbCredentialsand the "main" credentials are used to define theducklake_client - We always use an in-memory DuckDB connection for the
ducklake_client
| # TODO how does duckdb resolve the name of the database to the name of the dataset / pipeline | ||
| ducklake_name: str = "ducklake", | ||
| *, | ||
| catalog_database: Optional[Union[ConnectionStringCredentials, DuckDbCredentials]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres, mysql, duckdb, motherduck are all ConnectionStringCredentials so maybe that's enough to put here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use drivername to distinguish them
| return caps | ||
|
|
||
|
|
||
| # TODO support connecting to a snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be amazing but we can do that later. snapshots mean reproducible local environments that you can get with 0 copy
| attach_statement = f"ATTACH IF NOT EXISTS 'ducklake:{ducklake_name}.ducklake'" | ||
| if storage: | ||
| # TODO handle storage credentials by creating secrets | ||
| attach_statement += f" (DATA_PATH {storage.bucket_url})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should pass storage to create_secret before you attach (after you open the connection)
| ) | ||
|
|
||
|
|
||
| def test_native_duckdb_workflow(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to do a few "smoke tests". the next step would be to enable ducklake to be tested for exactly the same tests as duckdb using ie. local duckdb as catalog and local filesystem as storage.
let's do another iteration of this ticket and then I'll look at this. I was able to do the same with iceberg destination so I'm pretty sure it will work
|
|
||
|
|
||
| # TODO add connection to a specific snapshot | ||
| # TODO does it make sense for ducklake to have a staging destination? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point see here: #1692
|
|
||
| return DuckLakeClient | ||
|
|
||
| def _raw_capabilities(self) -> DestinationCapabilitiesContext: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: ducklake will support upsert (MERGE INTO) so we can enable this strategy to see if it works
cac1f1d to
d342f77
Compare
…ile names, allows to copy local file context in WithLocalFiles
…ng when connections opened in duckdb, improves error handling if commit tx fails
… (2) point all local files to local_dir (3) allow various urls to configure ducklake name (4) uses parquet as default file format
…open_connection which provides full context)
|
there were pretty complicated issues with parallel loading, also depended on catalog types. I underestimated the effort a little when writing original ticket... luckily it seems that most of the work is done. below is a description of changes I've made. also see the commit log:
what is left:
let's touch base how and when to continue here |
a61e94b to
3cac519
Compare
…d name in other cases
99b704a to
9e1c2eb
Compare
| url = url._replace(query=None) | ||
| # we only have control over netloc/path | ||
| return url.render_as_string(hide_password=True) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the exception be more specific here? Could swallowing all exceptions hide potential issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix You should fix directly because I don't know what type of exceptions you expect
burnash
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix please see the comments.
| # NOTE: database must be detached otherwise it is left in inconsistent state | ||
| # TODO: perhaps move attach/detach to connection pool | ||
| self._conn.execute(self.attach_statement) | ||
| self._conn.execute(f"USE {self.credentials.catalog_name};") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use escape_identifier() instead of passing a bare catalog_name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that it's better to escape the identifier. We should also improve assertions to check that catalog_name is a valid DuckDB SQL identifier here. i.e., there's no funny character in catalog_name
|
Adding here for legacy:
The parameter |
Related Issues
Core changes
DuckDbBaseCredentials: it is decoupled fromConnectionStringCredentialsand used to configure connection options, pragmas, extensions and host connection pool. reason: ducklake does not needConnectionStringCredentialsat the top level. it only needscatalog_name. Overall this looks much nicerDuckDbBaseCredentials, Reason: this is the only singleton persisted during the whole load step. Correct implementation: allow for connection pool as top level entity. For now: good enoughdltthat points all relative paths to data (ie. duckdb database, local filesystem data). I cleaned it up and make it easy to propagate redirects to embedded configurations (ie. now both catalog and storage are redirected). Still I'm not happy with how it is implemented... I'll give it another run in the future