Skip to content

Prevent users from setting the same name for final and staging dataset #3047

@sh-rp

Description

@sh-rp

TLDR
We use staging datasets for transactional safety on merge write_dispositions as well as some variants of the replace write_disposition. The default behavior is to have a second dataset called "<dataset_name>_staging". Users can change this name which can lead to a setup where final and staging datasets have the same name. We should prevent this or at least print a big fat warning if users try to do this, as data in the final dataset will be truncated by the setup commands that should only truncate the staging dataset.

ToDo

  • Learn about the staging dataset: https://dlthub.com/docs/dlt-ecosystem/staging#staging-dataset
  • Add a new method to the WithStagingDataset class: def create_dataset_names(self, schema: Schema, config: DestinationClientDwhConfiguration) -> Tuple[str, str]:, which creates the regular and the staging dataset names for a given schema and config, this method should also raise an Exception if both are the same. See the point below to find the places where these normalized names are created.
  • Use this new method to create the normalized regular and staging dataset names in for all destinations (including the filesystem destination). You can find all destination implementations under dlt/destinations/impl, or just search for all the places where normalize_staging_dataset_name() is used.
  • Write tests that demonstrate that this exception is raised if both datasets end up having the same name after normalization.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions