-
Notifications
You must be signed in to change notification settings - Fork 414
Description
dlt version
1.14.1 (but the bug is present in the latest devel branch code as well)
Describe the problem
When using bigquery_adapter() with the same column for both partition and cluster parameters, the partition hint is silently overwritten by the cluster hint. This results in tables being created with clustering but no partitioning, even though partitioning was explicitly requested.
This is because column_hints is a dictionary keyed by column name, and the cluster hint assignment overwrites the partition hint assignment.
Expected behavior
When the same column is specified for both partition and cluster:
Option 1: dlt should raise a clear ValueError explaining that a column cannot be both the partition and cluster column, since BigQuery's partition column already provides ordering benefits.
Option 2: dlt should merge the hints, resulting in column_hints[column] = {"name": column, PARTITION_HINT: True, CLUSTER_HINT: True} (this may not be useful since partitioning on a column already provides similar benefits to clustering on it).
Steps to reproduce
import dlt
from dlt.destinations.impl.bigquery.bigquery_adapter import bigquery_adapter, PARTITION_HINT, CLUSTER_HINT
@dlt.resource(columns=[{"name": "timestamp_col", "data_type": "timestamp"}])
def test_data():
yield {"timestamp_col": "2024-01-01T00:00:00Z"}
# Apply adapter with same column for partition and cluster
bigquery_adapter(
test_data,
partition="timestamp_col",
cluster=["timestamp_col", "other_col"]
)
# Check the column hints - PARTITION_HINT is missing!
print(test_data.columns)
# Output: {'timestamp_col': {'name': 'timestamp_col', 'x-bigquery-cluster': True}, ...}
# Expected: {'timestamp_col': {'name': 'timestamp_col', 'x-bigquery-partition': True, 'x-bigquery-cluster': True}, ...}
# Verify partition hint is lost
assert PARTITION_HINT in test_data.columns.get("timestamp_col", {}), "PARTITION_HINT was overwritten!"
# AssertionError: PARTITION_HINT was overwritten!Minimal reproducer script:
from dlt.destinations.impl.bigquery.bigquery_adapter import (
bigquery_adapter,
PARTITION_HINT,
CLUSTER_HINT
)
import dlt
@dlt.resource
def data():
yield {"ts": "2024-01-01"}
bigquery_adapter(data, partition="ts", cluster=["ts"])
# This will fail - proving the bug
assert PARTITION_HINT in data.columns["ts"], f"Bug! Hints: {data.columns['ts']}"Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
sql_database (PostgreSQL via PyArrow backend)
dlt destination
Google BigQuery
Other deployment details
No response
Additional information
Suggested fix
# Option 1: Raise error
if partition and cluster:
partition_col = partition if isinstance(partition, str) else partition.column_name
if partition_col in cluster:
raise ValueError(
f"Column '{partition_col}' cannot be both partition and cluster column. "
"The partition column already provides query optimization benefits similar to clustering."
)
# Option 2: Merge hints (update instead of replace)
for column_name in cluster:
if column_name in column_hints:
column_hints[column_name][CLUSTER_HINT] = True
else:
column_hints[column_name] = {"name": column_name, CLUSTER_HINT: True}Metadata
Metadata
Assignees
Labels
Type
Projects
Status