-
Notifications
You must be signed in to change notification settings - Fork 414
Description
dlt version
1.20.0
Describe the problem
When schema_contract: {"columns: "freeze"} is enabled, dlt raises a contract violation error if a column exists in both the imported schema and the source data but has a mismatch in column properties (specifically timezone).
The error message is misleading:
Contract on columns with contract_mode=freeze is violated.
Can't add table column created_on to table int_aga_key_bp because columns are frozen.
This implies the column is "new" and being added, whereas it actually exists but has a property mismatch.
Expected behavior
It should give a precise error on where the mismatch is exactly. And it should be a data_type error, not a new column error.
Steps to reproduce
This bash script demonstrates the issue:
# Clean up from previous runs
rm -rf /tmp/test_dlt_timezone
rm -rf ~/.dlt/pipelines/timezone_contract_test
# Create the first Python script (establishes schema with timezone=False)
cat > /tmp/test_timezone_run1.py << 'EOF'
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "dlt[duckdb]==1.19.1",
# ]
# ///
import dlt
from datetime import datetime
@dlt.resource(
name="my_table",
write_disposition="append",
columns={
"created_at": {
"data_type": "timestamp",
"timezone": False # Explicitly False
}
}
)
def my_resource():
yield {"id": 1, "created_at": datetime(2024, 1, 1, 12, 0, 0)}
pipeline = dlt.pipeline(
pipeline_name="timezone_contract_test",
destination="duckdb",
dataset_name="test_data",
pipelines_dir="/tmp/test_dlt_timezone",
)
# Set contract mode to freeze after first run establishes the schema
pipeline.run(my_resource(),
schema_contract={"columns": "freeze", "data_type": "freeze"},
)
print("First run completed - schema established with timezone=False")
# Now update the schema to freeze mode
schema = pipeline.default_schema
schema.tables["my_table"]["schema_contract"] = {"columns": "freeze"}
pipeline.schemas.save_schema(schema)
print("Schema contract set to freeze mode")
EOF
# Create the second Python script (tries to use timezone=True, should fail)
cat > /tmp/test_timezone_run2.py << 'EOF'
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "dlt[duckdb]==1.19.1",
# ]
# ///
import dlt
from datetime import datetime, timezone
@dlt.resource(
name="my_table",
write_disposition="append",
columns={
"created_at": {
"data_type": "timestamp",
"timezone": True # Now True - this should trigger the mismatch
}
}
)
def my_resource():
yield {"id": 2, "created_at": datetime(2024, 1, 2, 12, 0, 0, tzinfo=timezone.utc)}
pipeline = dlt.pipeline(
pipeline_name="timezone_contract_test",
destination="duckdb",
dataset_name="test_data",
pipelines_dir="/tmp/test_dlt_timezone"
)
# This run should fail with contract violation due to timezone mismatch
pipeline.run(my_resource(),
schema_contract={"columns": "freeze", "data_type": "freeze"},
)
EOF
echo "=== First run (establishes schema with timezone=False, then sets freeze) ==="
uv run /tmp/test_timezone_run1.py
echo ""
echo "=== Second run (tries timezone=True, should trigger contract violation) ==="
uv run /tmp/test_timezone_run2.py
echo ""
echo "=== Notice: Error says 'Can't add table column' even though the column exists! ==="
echo "=== The error should say something like 'Column property mismatch: timezone' ==="This gives output similar to:
=== First run (establishes schema with timezone=False, then sets freeze) ===
First run completed - schema established with timezone=False
Schema contract set to freeze mode
=== Second run (tries timezone=True, should trigger contract violation) ===
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at `step=extract` when processing package with `load_id=1765891903.8693728` with exception:
<class 'dlt.common.schema.exceptions.DataValidationError'>
In schema `timezone_contract_test`: In Table: `my_table` Column: `created_at` . Contract on `columns` with `contract_mode=freeze` is violated. Can't add table column `created_at` to table `my_table` because `columns` are frozen. Offending data item: _dlt_id: None
The misleading error message says the column "can't be added" when it already exists - the real issue is the timezone property mismatch between False (in the frozen schema) and True (in the new resource definition).
Operating system
Windows, macOS
Runtime environment
Local
Python version
3.12
dlt data source
a (customized) sql_database source
dlt destination
Filesystem & buckets
Other deployment details
Additional information
Metadata
Metadata
Assignees
Labels
Type
Projects
Status