-
Notifications
You must be signed in to change notification settings - Fork 414
Description
Feature description
I'd like to suggest adding support for column descriptions in the BigQuery adapter. This would allow users to define and manage column-level metadata when loading data into BigQuery tables. This feature would enhance the data documentation and governance capabilities within the dlt BigQuery integration.
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
When loading data into BigQuery, having clear and accurate column descriptions is critical for data governance and documentation purposes. Currently, DLT doesn't provide a way to define column descriptions during data loading, making it difficult to maintain metadata and properly document data structures.
Problem solved:
- Data teams must manually add column descriptions after data is loaded.
- Risk of inconsistent or missing column documentation
- Additional overhead of maintaining separate data dictionaries
- Difficulty in tracking and versioning changes to column metadata
Proposed solution
Implement column description support in the BigQuery adapter by
Extending the BigQuery adapter interface to accept column descriptions during schema definition
Modify dlt/destinations/impl/bigquery/bigquery_adapter.py and dlt/destinations/impl/bigquery/bigquery.py to add column description parameters:
class BigQueryClient(SqlJobClientWithStagingDataset, SupportsStagingDestination):
...
def _get_column_def_sql(self, column: TColumnSchema, table: PreparedTableSchema = None) -> str:
column_def_sql = super()._get_column_def_sql(column, table)
if column.get(ROUND_HALF_EVEN_HINT, False):
column_def_sql += " OPTIONS (rounding_mode='ROUND_HALF_EVEN')"
if column.get(ROUND_HALF_AWAY_FROM_ZERO_HINT, False):
column_def_sql += " OPTIONS (rounding_mode='ROUND_HALF_AWAY_FROM_ZERO')"
# adding this
if "description" in column:
column_def_sql += f" OPTIONS (description='{column['description']}')"
return column_def_sql
...def bigquery_adapter(
data: Any,
partition: TColumnNames = None,
cluster: TColumnNames = None,
round_half_away_from_zero: TColumnNames = None,
round_half_even: TColumnNames = None,
table_description: Optional[str] = None,
table_expiration_datetime: Optional[str] = None,
insert_api: Optional[Literal["streaming", "default"]] = None,
autodetect_schema: Optional[bool] = None,
partition_expiration_days: Optional[int] = None,
column_descriptions: Optional[Dict[str, str]] = None, # Added parameter
) -> DltResource:
...
# adding column description
if column_descriptions:
for column_name, description in column_descriptions.items():
if column_name in column_hints:
column_hints[column_name]["description"] = description
else:
column_hints[column_name] = {"name": column_name, "description": description}
if partition:
...The implementation would match BigQuery's native support for column descriptions, while maintaining dlt's current functionality and performance.
An example usage could look like this
data = [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
column_descriptions = {"id": "User ID", "name": "User Name"}
pipeline = dlt.pipeline(
pipeline_name="description_test",
destination="bigquery",
dataset_name="description_test",
export_schema_path="dlt_schemas/export",
)
load_info = pipeline.run(
bigquery_adapter(data, column_descriptions=column_descriptions),
table_name="users",
write_disposition="replace",
refresh="drop_data",
)
print(load_info)Related issues
No related issues found.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status