-
Notifications
You must be signed in to change notification settings - Fork 418
1272 Support ClickHouse GCS S3 compatibility mode in filesystem destination #1423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
sh-rp
merged 13 commits into
devel
from
1272-make-filesystem-destination-work-with-gcs-in-s3-compatibility-mode
Jun 3, 2024
Merged
Changes from 12 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
cb80c6b
Add HMAC credentials and update Clickhouse configuration
cabe61c
Revert "Add HMAC credentials and update Clickhouse configuration"
f24eb1d
Refactor error handling for storage authentication in Clickhouse
c30f23c
Revert "Refactor error handling for storage authentication in Clickho…
217f6f7
Remove GCS ClickHouse buckets in CI until named destinations are supp…
2dcd848
Merge remote-tracking branch 'origin/devel' into 1272-make-filesystem…
c9c1394
Add GCS S3 compatibility test, remove GCP credentials from Clickhouse
7a43618
Refactor ClickHouse test code for better readability
abd87f8
Refactor endpoint handling and update GCS bucket configuration
4c3186b
Refactor test for clickhouse gcs_s3 compatibility
743c1a2
Update ClickHouse docs and tests for S3-compatible staging
55e33d7
Merge branch 'devel' into 1272-make-filesystem-destination-work-with-…
9a58f8f
Update ClickHouse documentation on staging areas
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -115,12 +115,14 @@ destination. | |
|
|
||
| The `clickhouse` destination has a few specific deviations from the default sql destinations: | ||
|
|
||
| 1. `Clickhouse` has an experimental `object` datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex datatype to a `text` column. If you need this feature, get in touch with our Slack community, and we will consider adding it. | ||
| 1. `Clickhouse` has an experimental `object` datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex datatype to a `text` column. If you need | ||
| this feature, get in touch with our Slack community, and we will consider adding it. | ||
| 2. `Clickhouse` does not support the `time` datatype. Time will be loaded to a `text` column. | ||
| 3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string, when loading from parquet this will be | ||
| the `binary` object converted to `text`. | ||
| 4. `Clickhouse` accepts adding columns to a populated table that are not null. | ||
| 5. `Clickhouse` can produce rounding errors under certain conditions when using the float / double datatype. Make sure to use decimal if you cannot afford to have rounding errors. Loading the value 12.7001 to a double column with the loader file format jsonl set will predictbly produce a rounding error for example. | ||
| 5. `Clickhouse` can produce rounding errors under certain conditions when using the float / double datatype. Make sure to use decimal if you cannot afford to have rounding errors. Loading the value | ||
| 12.7001 to a double column with the loader file format jsonl set will predictbly produce a rounding error for example. | ||
|
|
||
| ## Supported column hints | ||
|
|
||
|
|
@@ -173,51 +175,42 @@ pipeline = dlt.pipeline( | |
| ) | ||
| ``` | ||
|
|
||
| ### Using Google Cloud Storage as a Staging Area | ||
| ### Using S3-Compatible Storage as a Staging Area | ||
|
|
||
| dlt supports using Google Cloud Storage (GCS) as a staging area when loading data into ClickHouse. This is handled automatically by | ||
| ClickHouse's [GCS table function](https://clickhouse.com/docs/en/sql-reference/table-functions/gcs) which dlt uses under the hood. | ||
| dlt supports using S3-compatible storage services, including Google Cloud Storage (GCS), as a staging area when loading data into ClickHouse. | ||
| This is handled automatically by | ||
| ClickHouse's [GCS table function](https://clickhouse.com/docs/en/sql-reference/table-functions/gcs), which dlt uses under the hood. | ||
|
|
||
| The clickhouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys. To enable this, GCS provides an S3 compatibility mode that emulates | ||
| the Amazon S3 | ||
| API. ClickHouse takes advantage of this to allow accessing GCS buckets via its S3 integration. | ||
| The ClickHouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys, which is compatible with the Amazon S3 API. | ||
| To enable this, GCS provides an S3 | ||
| compatibility mode that emulates the S3 API, allowing ClickHouse to access GCS buckets via its S3 integration. | ||
|
|
||
| For detailed instructions on setting up S3-compatible storage with dlt, including AWS S3, MinIO, and Cloudflare R2, refer to | ||
| the [dlt documentation on filesystem destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#using-s3-compatible-storage). | ||
|
|
||
| To set up GCS staging with HMAC authentication in dlt: | ||
|
|
||
| 1. Create HMAC keys for your GCS service account by following the [Google Cloud guide](https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create). | ||
|
|
||
| 2. Configure the HMAC keys as well as the `client_email`, `project_id` and `private_key` for your service account in your dlt project's ClickHouse destination settings in `config.toml`: | ||
| 2. Configure the HMAC keys (`aws_access_key_id` and `aws_secret_access_key`) in your dlt project's ClickHouse destination settings in `config.toml`, similar to how you would configure AWS S3 | ||
| credentials: | ||
|
|
||
| ```toml | ||
| [destination.filesystem] | ||
| bucket_url = "gs://dlt-ci" | ||
| bucket_url = "s3://my_awesome_bucket" | ||
|
|
||
| [destination.filesystem.credentials] | ||
| project_id = "a-cool-project" | ||
| client_email = "[email protected]" | ||
| private_key = "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkaslkdjflasjnkdcopauihj...wEiEx7y+mx\nNffxQBqVVej2n/D93xY99pM=\n-----END PRIVATE KEY-----\n" | ||
|
|
||
| [destination.clickhouse.credentials] | ||
| database = "dlt" | ||
| username = "dlt" | ||
| password = "Dlt*12345789234567" | ||
| host = "localhost" | ||
| port = 9440 | ||
| secure = 1 | ||
| gcp_access_key_id = "JFJ$$*f2058024835jFffsadf" | ||
| gcp_secret_access_key = "DFJdwslf2hf57)%$02jaflsedjfasoi" | ||
| aws_access_key_id = "JFJ$$*f2058024835jFffsadf" | ||
| aws_secret_access_key = "DFJdwslf2hf57)%$02jaflsedjfasoi" | ||
| project_id = "my-awesome-project" | ||
| endpoint_url = "https://storage.googleapis.com" | ||
| ``` | ||
|
|
||
| Note: In addition to the HMAC keys (`gcp_access_key_id` and `gcp_secret_access_key`), you now need to provide the `client_email`, `project_id` and `private_key` for your service account | ||
| under `[destination.filesystem.credentials]`. | ||
| This is because the GCS staging support is now implemented as a temporary workaround and is still unoptimized. | ||
|
|
||
| dlt will pass these credentials to ClickHouse which will handle the authentication and GCS access. | ||
|
|
||
| There is active work in progress to simplify and improve the GCS staging setup for the ClickHouse dlt destination in the future. Proper GCS staging support is being tracked in these GitHub issues: | ||
|
|
||
| - [Make filesystem destination work with gcs in s3 compatibility mode](https://github.com/dlt-hub/dlt/issues/1272) | ||
| - [GCS staging area support](https://github.com/dlt-hub/dlt/issues/1181) | ||
| :::caution | ||
| When configuring the `bucket_url` for S3-compatible storage services like Google Cloud Storage (GCS) with ClickHouse in dlt, ensure that the URL is prepended with `s3://` instead of `gs://`. This is | ||
| because the ClickHouse GCS table function requires the use of HMAC credentials, which are compatible with the S3 API. Prepending with `s3://` allows the HMAC credentials to integrate properly with | ||
| dlt's staging mechanisms for ClickHouse. | ||
| ::: | ||
|
|
||
| ### dbt support | ||
|
|
||
|
|
||
28 changes: 28 additions & 0 deletions
28
tests/load/clickhouse/test_clickhouse_gcs_s3_compatibility.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| from typing import Generator, Dict | ||
|
|
||
| import pytest | ||
|
|
||
| import dlt | ||
| from dlt.destinations import filesystem | ||
| from tests.load.utils import GCS_BUCKET | ||
| from tests.pipeline.utils import assert_load_info | ||
|
|
||
|
|
||
| @pytest.mark.essential | ||
| def test_clickhouse_gcs_s3_compatibility() -> None: | ||
Pipboyguy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| @dlt.resource | ||
| def dummy_data() -> Generator[Dict[str, int], None, None]: | ||
| yield {"field1": 1, "field2": 2} | ||
|
|
||
| gcp_bucket = filesystem( | ||
| GCS_BUCKET.replace("gs://", "s3://"), destination_name="filesystem_s3_gcs_comp" | ||
| ) | ||
|
|
||
| pipe = dlt.pipeline( | ||
| pipeline_name="gcs_s3_compatibility", | ||
| destination="clickhouse", | ||
| staging=gcp_bucket, | ||
| full_refresh=True, | ||
| ) | ||
| pack = pipe.run([dummy_data]) | ||
| assert_load_info(pack) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.