Skip to content

feat: Add column description support for BigQuery adapter #2179

@K-Oxon

Description

@K-Oxon

Feature description

I'd like to suggest adding support for column descriptions in the BigQuery adapter. This would allow users to define and manage column-level metadata when loading data into BigQuery tables. This feature would enhance the data documentation and governance capabilities within the dlt BigQuery integration.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

When loading data into BigQuery, having clear and accurate column descriptions is critical for data governance and documentation purposes. Currently, DLT doesn't provide a way to define column descriptions during data loading, making it difficult to maintain metadata and properly document data structures.

Problem solved:

  • Data teams must manually add column descriptions after data is loaded.
  • Risk of inconsistent or missing column documentation
  • Additional overhead of maintaining separate data dictionaries
  • Difficulty in tracking and versioning changes to column metadata

Proposed solution

Implement column description support in the BigQuery adapter by

Extending the BigQuery adapter interface to accept column descriptions during schema definition
Modify dlt/destinations/impl/bigquery/bigquery_adapter.py and dlt/destinations/impl/bigquery/bigquery.py to add column description parameters:

class BigQueryClient(SqlJobClientWithStagingDataset, SupportsStagingDestination):
    ...
    def _get_column_def_sql(self, column: TColumnSchema, table: PreparedTableSchema = None) -> str:
        column_def_sql = super()._get_column_def_sql(column, table)
        if column.get(ROUND_HALF_EVEN_HINT, False):
            column_def_sql += " OPTIONS (rounding_mode='ROUND_HALF_EVEN')"
        if column.get(ROUND_HALF_AWAY_FROM_ZERO_HINT, False):
            column_def_sql += " OPTIONS (rounding_mode='ROUND_HALF_AWAY_FROM_ZERO')"
        
        # adding this
        if "description" in column:
            column_def_sql += f" OPTIONS (description='{column['description']}')"
        
        return column_def_sql
    ...
def bigquery_adapter(
    data: Any,
    partition: TColumnNames = None,
    cluster: TColumnNames = None,
    round_half_away_from_zero: TColumnNames = None,
    round_half_even: TColumnNames = None,
    table_description: Optional[str] = None,
    table_expiration_datetime: Optional[str] = None,
    insert_api: Optional[Literal["streaming", "default"]] = None,
    autodetect_schema: Optional[bool] = None,
    partition_expiration_days: Optional[int] = None,
    column_descriptions: Optional[Dict[str, str]] = None,  # Added parameter
) -> DltResource:
    ...
    # adding column description
    if column_descriptions:
        for column_name, description in column_descriptions.items():
            if column_name in column_hints:
                column_hints[column_name]["description"] = description
            else:
                column_hints[column_name] = {"name": column_name, "description": description}
    
    if partition:
        ...

The implementation would match BigQuery's native support for column descriptions, while maintaining dlt's current functionality and performance.
An example usage could look like this

data = [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
column_descriptions = {"id": "User ID", "name": "User Name"}

pipeline = dlt.pipeline(
    pipeline_name="description_test",
    destination="bigquery",
    dataset_name="description_test",
    export_schema_path="dlt_schemas/export",
)
load_info = pipeline.run(
    bigquery_adapter(data, column_descriptions=column_descriptions),
    table_name="users",
    write_disposition="replace",
    refresh="drop_data",
)

print(load_info)

Related issues

No related issues found.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions