Skip to content

schema_contract freeze on columns reports schema violation when column is not new #3490

@timvink

Description

@timvink

dlt version

1.20.0

Describe the problem

When schema_contract: {"columns: "freeze"} is enabled, dlt raises a contract violation error if a column exists in both the imported schema and the source data but has a mismatch in column properties (specifically timezone).

The error message is misleading:

Contract on columns with contract_mode=freeze is violated.
Can't add table column created_on to table int_aga_key_bp because columns are frozen.

This implies the column is "new" and being added, whereas it actually exists but has a property mismatch.

Expected behavior

It should give a precise error on where the mismatch is exactly. And it should be a data_type error, not a new column error.

Steps to reproduce

This bash script demonstrates the issue:

# Clean up from previous runs
rm -rf /tmp/test_dlt_timezone
rm -rf ~/.dlt/pipelines/timezone_contract_test

# Create the first Python script (establishes schema with timezone=False)
cat > /tmp/test_timezone_run1.py << 'EOF'
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "dlt[duckdb]==1.19.1",
# ]
# ///
import dlt
from datetime import datetime

@dlt.resource(
    name="my_table",
    write_disposition="append",
    columns={
        "created_at": {
            "data_type": "timestamp",
            "timezone": False  # Explicitly False
        }
    }
)
def my_resource():
    yield {"id": 1, "created_at": datetime(2024, 1, 1, 12, 0, 0)}

pipeline = dlt.pipeline(
    pipeline_name="timezone_contract_test",
    destination="duckdb",
    dataset_name="test_data",
    pipelines_dir="/tmp/test_dlt_timezone",
)

# Set contract mode to freeze after first run establishes the schema
pipeline.run(my_resource(),
    schema_contract={"columns": "freeze", "data_type": "freeze"},
)
print("First run completed - schema established with timezone=False")

# Now update the schema to freeze mode
schema = pipeline.default_schema
schema.tables["my_table"]["schema_contract"] = {"columns": "freeze"}
pipeline.schemas.save_schema(schema)
print("Schema contract set to freeze mode")
EOF

# Create the second Python script (tries to use timezone=True, should fail)
cat > /tmp/test_timezone_run2.py << 'EOF'
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "dlt[duckdb]==1.19.1",
# ]
# ///
import dlt
from datetime import datetime, timezone

@dlt.resource(
    name="my_table",
    write_disposition="append",
    columns={
        "created_at": {
            "data_type": "timestamp",
            "timezone": True  # Now True - this should trigger the mismatch
        }
    }
)
def my_resource():
    yield {"id": 2, "created_at": datetime(2024, 1, 2, 12, 0, 0, tzinfo=timezone.utc)}

pipeline = dlt.pipeline(
    pipeline_name="timezone_contract_test",
    destination="duckdb",
    dataset_name="test_data",
    pipelines_dir="/tmp/test_dlt_timezone"
)

# This run should fail with contract violation due to timezone mismatch
pipeline.run(my_resource(),
    schema_contract={"columns": "freeze", "data_type": "freeze"},
)
EOF

echo "=== First run (establishes schema with timezone=False, then sets freeze) ==="
uv run /tmp/test_timezone_run1.py

echo ""
echo "=== Second run (tries timezone=True, should trigger contract violation) ==="
uv run /tmp/test_timezone_run2.py

echo ""
echo "=== Notice: Error says 'Can't add table column' even though the column exists! ==="
echo "=== The error should say something like 'Column property mismatch: timezone' ==="

This gives output similar to:

=== First run (establishes schema with timezone=False, then sets freeze) ===
First run completed - schema established with timezone=False
Schema contract set to freeze mode

=== Second run (tries timezone=True, should trigger contract violation) ===
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at `step=extract` when processing package with `load_id=1765891903.8693728` with exception:

<class 'dlt.common.schema.exceptions.DataValidationError'>
In schema `timezone_contract_test`: In Table: `my_table` Column: `created_at` . Contract on `columns` with `contract_mode=freeze` is violated. Can't add table column `created_at` to table `my_table` because `columns` are frozen. Offending data item: _dlt_id: None

The misleading error message says the column "can't be added" when it already exists - the real issue is the timezone property mismatch between False (in the frozen schema) and True (in the new resource definition).

Operating system

Windows, macOS

Runtime environment

Local

Python version

3.12

dlt data source

a (customized) sql_database source

dlt destination

Filesystem & buckets

Other deployment details

Additional information

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions