Skip to content

bigquery_adapter() silently drops partition hint when same column is used for both partition and cluster #3471

@abuchmueller

Description

@abuchmueller

dlt version

1.14.1 (but the bug is present in the latest devel branch code as well)

Describe the problem

When using bigquery_adapter() with the same column for both partition and cluster parameters, the partition hint is silently overwritten by the cluster hint. This results in tables being created with clustering but no partitioning, even though partitioning was explicitly requested.
This is because column_hints is a dictionary keyed by column name, and the cluster hint assignment overwrites the partition hint assignment.

Expected behavior

When the same column is specified for both partition and cluster:

Option 1: dlt should raise a clear ValueError explaining that a column cannot be both the partition and cluster column, since BigQuery's partition column already provides ordering benefits.

Option 2: dlt should merge the hints, resulting in column_hints[column] = {"name": column, PARTITION_HINT: True, CLUSTER_HINT: True} (this may not be useful since partitioning on a column already provides similar benefits to clustering on it).

Steps to reproduce

import dlt
from dlt.destinations.impl.bigquery.bigquery_adapter import bigquery_adapter, PARTITION_HINT, CLUSTER_HINT

@dlt.resource(columns=[{"name": "timestamp_col", "data_type": "timestamp"}])
def test_data():
    yield {"timestamp_col": "2024-01-01T00:00:00Z"}

# Apply adapter with same column for partition and cluster
bigquery_adapter(
    test_data,
    partition="timestamp_col",
    cluster=["timestamp_col", "other_col"]
)

# Check the column hints - PARTITION_HINT is missing!
print(test_data.columns)
# Output: {'timestamp_col': {'name': 'timestamp_col', 'x-bigquery-cluster': True}, ...}
# Expected: {'timestamp_col': {'name': 'timestamp_col', 'x-bigquery-partition': True, 'x-bigquery-cluster': True}, ...}

# Verify partition hint is lost
assert PARTITION_HINT in test_data.columns.get("timestamp_col", {}), "PARTITION_HINT was overwritten!"
# AssertionError: PARTITION_HINT was overwritten!

Minimal reproducer script:

from dlt.destinations.impl.bigquery.bigquery_adapter import (
    bigquery_adapter, 
    PARTITION_HINT, 
    CLUSTER_HINT
)
import dlt

@dlt.resource
def data():
    yield {"ts": "2024-01-01"}

bigquery_adapter(data, partition="ts", cluster=["ts"])

# This will fail - proving the bug
assert PARTITION_HINT in data.columns["ts"], f"Bug! Hints: {data.columns['ts']}"

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

sql_database (PostgreSQL via PyArrow backend)

dlt destination

Google BigQuery

Other deployment details

No response

Additional information

Suggested fix

# Option 1: Raise error
if partition and cluster:
    partition_col = partition if isinstance(partition, str) else partition.column_name
    if partition_col in cluster:
        raise ValueError(
            f"Column '{partition_col}' cannot be both partition and cluster column. "
            "The partition column already provides query optimization benefits similar to clustering."
        )

# Option 2: Merge hints (update instead of replace)
for column_name in cluster:
    if column_name in column_hints:
        column_hints[column_name][CLUSTER_HINT] = True
    else:
        column_hints[column_name] = {"name": column_name, CLUSTER_HINT: True}

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdestinationIssue with a specific destinationgood first issueGood for newcomers

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions