-
Notifications
You must be signed in to change notification settings - Fork 1k
Implement soft deletes for projects, releases and files #4440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I actually don't think it's complicated at all by our SQLa hooks, because assuming soft delete is implemented as a column on the object, then that'll trigger the hook into the SQLAlchemy modified event (which we already have) to purge the cache. It would only be complicated if we implemented soft delete as an additional table or something silly like that. |
Right, of course. Should be pretty simple then, actually. |
I think the biggest questions I have is what should this look like in practice? Soft deletes for files are obvious, we treat them as immutable anyways so there isn’t much question about them. Releases are a bit trickier. Right now you can delete a release and upload a new release with the same version but different content. However this only really works if you are able to upload files that didn’t already exist (e.g. you could upload a .zip if you had previously uploaded a .tar.gz). However it might be reasonable to just treat this as closing another loophole to where there is no longer a way to change a release metadata after it was first created. One edge case about this is what would we do if you tried to upload a file that had never existed to a soft deleted release? Would we undelete implicitly? What if the metadata changed? Projects are maybe the hardest bit here. Generally the reason you want to delete a project is to enable another project to take that name (either now or in the future). You probably don’t want the ability for those new people to “undelete” a project. Although maybe the answer here is that we can implement soft delete for projects, and still show them on original owners pages (or in a sub page?) to undelete them. However if someone else comes along and tries to register that name, then we treat it as a hard delete and we actually remove the entire project, releases, files, etc before we register that as a brand new project to the new user. A few other questions:
|
Just thinking through these...
I think the easiest thing to do here would be to block the upload. If the user actually wanted the new distribution published, they would need to undelete the release, then the behavior would be the same as it is now (i.e., the metadata would not change).
I think it would be reasonable for a hard delete of a project (what we do now) to just never happen. This would make adding a new owner the "correct" way to transfer a project, and a new project owner would get visibility into all the previous releases that they can't reuse. The lack of this information is also a source of confusion when re-registering deleted projects (but that happens relatively infrequently right now).
I think we want to offer this via escalating to an admin (in the case that there is sensitive data in the release, perhaps?) but otherwise, no. Otherwise the confusion that this is trying to solve would still be possible, and there would be two different types of deletes for users to have to decide between. I'm generally looking at this from a user's perspective of "adding a feature to restore something I deleted" rather than "changing hard delete to a soft delete".
I think they should stick around, mostly for the reasons I mentioned about deleting/transferring projects above.
I think they should be able to see them, otherwise the permissions/views here are going to be based on when the owner was added and this is going to make it more complicated to implement. I think generally if you're trusting someone with the ability to delete your project, you can also trust with with the ability to undelete parts of it.
I think we can get by here with just the existing delete/upload journals. Mirrors shouldn't need to keep track of soft deletes, PyPI can just abstract that away for them. This makes the assumption that they can handle a series of |
FWIW, the question about whether a new user gets access to the old soft deleted items came from #4457, which isn't exactly the same but is somewhat similar in that both deal with the edge case of a user gaining access to data that came from before they were added to a project. Particularly if the user soft deleted the project years ago, and a new person just now registered the name again. |
Here's a WIP branch where I started to implement this: https://github.com/pypa/warehouse/compare/master...di:soft-deletes?expand=1 I got hung up on the |
Per discussion today, we'll want this for the upcoming work on detecting & deleting malware. A soft delete feature will mitigate the possibility of erroneous deletions by admins. (Right now the actual thing we do is delete the database record, so a "deleted" file is still available and could be recovered, but that's a headache compared to the undelete button that a soft delete feature would provide.) |
To clarify: this issue #4440 is not within scope of the current contract with OTF; it's foundational work for the next funded work milestone which will probably start a few months from now (but I don't have a specific date). |
Now that PEP 592 is accepted and implemented #5837, what else do we need to do to resolve this issue and unblock #7421? |
We should determine whether soft deleting a release/file affects the total project size or not. Might be confusing for users to delete one or more releases but not see the total size decrease. |
I have been working on this issue and have run into some challenges. I worked off of @di's WIP branch. We tried 2 approaches and discovered problems with both of them. Approach 1:
Approach 2:
We want to avoid changing existing queries while implementing this feature. Any advice/ideas on how to approach this issue generally or alter these approaches to make them work would be appreciated, thanks. |
@VikramJayanthi17 Approach 1 sounds most straightforward to me. I'm not aware of a cache being involved. Do you have some example code you could share? I'm wondering if perhaps the ObjectDeletedError could be transformed into the correct 404. |
This is the stack trace:
The code can be found on this branch : https://github.com/VikramJayanthi17/warehouse/tree/soft-deletes. Thanks for the help @ewdurbin and let me know if there is any other information I can provide. |
The issue is with our cache-purging mechanism (https://github.com/pypa/warehouse/blob/master/warehouse/cache/origin/__init__.py), specifically that the instance it's trying to generate keys for, which has just been modified, has just been modified in a way that causes it to be excluded from any query due to |
Ah! Thanks for context @di. @VikramJayanthi17 would you be OK opening a Draft PR from that branch? Easier to comment/track suggestions that way. |
What's the problem this feature will solve?
Currently there is some user frustration/friction about PyPI's policy of disallowing the reuse of filenames for distributions. I.e., once you delete a given distribution or release, it cannot be re-uploaded.
Describe the solution you'd like
One way we can improve this experience without reversing our policy is to allow for "soft" deletes. This would allow a maintainer to "delete" a file, release or project (same behavior as currently exist), but have the ability to see "deleted" files/releases/projects in the management UI, and be able to reverse the deletion.
Because we are currently not removing the actual files from our storage service, this will not result in significantly increased disk space. And while this will result in a slight increase in data in our database, it should be negligible because deletes happen relatively infrequently.
This feature is slightly complicated by the fact that currently we hook into the SQLAlchemy creation/deletion events to determine when to purge various pages from the cache. Removing true deletion of database objects will necessitate finding another way to initiate the purges.Additional context
See also pypa/packaging-problems#74, where this was originally described in pypa/packaging-problems#74 (comment).
The text was updated successfully, but these errors were encountered: