Skip to content

Conversation

@JBWilkie
Copy link
Contributor

@JBWilkie JBWilkie commented Oct 9, 2024

Problem

Recently, I discovered that it's possible to add slots to existing items. The ability to upload multi-file items with push was built on the assumption that this was impossible, and has led to the following behaviour (because we name slots differently based on the merge mode):

1: Upload some files as one merge mode --> No error
2: Upload the same files a different merge mode --> No error
3: Upload the same files as the same merge mode as step 2 again --> You get an error about skipping files

We've decided that this type of scenario should be blocked in darwin-py, as most users would expect deduplication validation to take place on the item-name level

Solution

Add a function to the UploadHandler constructor that runs the following before beginning the upload:

  • 1: Gets a full list of full remote filepaths from the target dataset
  • 2: Checks each planned full remote path against this list. If any path matches, we remove that file from the files to be uploaded and print a warning to the console

Changelog

Prevent upload of dataset items where the {item_path}/{item_name} already exists in the dataset

@linear
Copy link

linear bot commented Oct 9, 2024

@JBWilkie JBWilkie merged commit acb371d into master Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants