Skip to content

Implement soft deletes for projects, releases and files #4440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
di opened this issue Aug 2, 2018 · 16 comments
Open

Implement soft deletes for projects, releases and files #4440

di opened this issue Aug 2, 2018 · 16 comments
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss UX/UI design, user experience, user interface

Comments

@di
Copy link
Member

di commented Aug 2, 2018

What's the problem this feature will solve?
Currently there is some user frustration/friction about PyPI's policy of disallowing the reuse of filenames for distributions. I.e., once you delete a given distribution or release, it cannot be re-uploaded.

Describe the solution you'd like
One way we can improve this experience without reversing our policy is to allow for "soft" deletes. This would allow a maintainer to "delete" a file, release or project (same behavior as currently exist), but have the ability to see "deleted" files/releases/projects in the management UI, and be able to reverse the deletion.

Because we are currently not removing the actual files from our storage service, this will not result in significantly increased disk space. And while this will result in a slight increase in data in our database, it should be negligible because deletes happen relatively infrequently.

This feature is slightly complicated by the fact that currently we hook into the SQLAlchemy creation/deletion events to determine when to purge various pages from the cache. Removing true deletion of database objects will necessitate finding another way to initiate the purges.

Additional context
See also pypa/packaging-problems#74, where this was originally described in pypa/packaging-problems#74 (comment).

@di di added feature request developer experience Anything that improves the experience for Warehouse devs labels Aug 2, 2018
@dstufft
Copy link
Member

dstufft commented Aug 4, 2018

I actually don't think it's complicated at all by our SQLa hooks, because assuming soft delete is implemented as a column on the object, then that'll trigger the hook into the SQLAlchemy modified event (which we already have) to purge the cache.

It would only be complicated if we implemented soft delete as an additional table or something silly like that.

@di
Copy link
Member Author

di commented Aug 4, 2018

Right, of course. Should be pretty simple then, actually.

@dstufft
Copy link
Member

dstufft commented Aug 4, 2018

I think the biggest questions I have is what should this look like in practice?

Soft deletes for files are obvious, we treat them as immutable anyways so there isn’t much question about them.

Releases are a bit trickier. Right now you can delete a release and upload a new release with the same version but different content. However this only really works if you are able to upload files that didn’t already exist (e.g. you could upload a .zip if you had previously uploaded a .tar.gz). However it might be reasonable to just treat this as closing another loophole to where there is no longer a way to change a release metadata after it was first created. One edge case about this is what would we do if you tried to upload a file that had never existed to a soft deleted release? Would we undelete implicitly? What if the metadata changed?

Projects are maybe the hardest bit here. Generally the reason you want to delete a project is to enable another project to take that name (either now or in the future). You probably don’t want the ability for those new people to “undelete” a project. Although maybe the answer here is that we can implement soft delete for projects, and still show them on original owners pages (or in a sub page?) to undelete them. However if someone else comes along and tries to register that name, then we treat it as a hard delete and we actually remove the entire project, releases, files, etc before we register that as a brand new project to the new user.

A few other questions:

  • Do we want to offer the ability to hard delete an item still? Or would soft deletes be the only thing available to end users?
  • Do we want to keep soft deleted items around forever, or should they “time out” after a period of time and get purged eventually?
  • Should someone who has just been added to the project be able to see all of the existing soft deleted items (and restore them) or should they only see a snapshot of what has been soft deleted since they’ve been added?
    • if we only show them a snapshot, we should probably garbage collect soft deleted items that nobody can see anymore.
  • How Do we record these events in the journal for mirrors to use? Do they just look like normal deletes and reuploads or do they look like soft deletes and restores?
    • If they look like deletes and uploads, do we need to separate out an audit log vs the journal structure that mirrors use to mirror?
    • If they look like soft deletes and restores, can existing mirror clients handle that?

@di
Copy link
Member Author

di commented Aug 5, 2018

Just thinking through these...

One edge case about this is what would we do if you tried to upload a file that had never existed to a soft deleted release? Would we undelete implicitly? What if the metadata changed?

I think the easiest thing to do here would be to block the upload. If the user actually wanted the new distribution published, they would need to undelete the release, then the behavior would be the same as it is now (i.e., the metadata would not change).

Generally the reason you want to delete a project is to enable another project to take that name (either now or in the future). You probably don’t want the ability for those new people to “undelete” a project. Although maybe the answer here is that we can implement soft delete for projects, and still show them on original owners pages (or in a sub page?) to undelete them. However if someone else comes along and tries to register that name, then we treat it as a hard delete and we actually remove the entire project, releases, files, etc before we register that as a brand new project to the new user.

I think it would be reasonable for a hard delete of a project (what we do now) to just never happen. This would make adding a new owner the "correct" way to transfer a project, and a new project owner would get visibility into all the previous releases that they can't reuse. The lack of this information is also a source of confusion when re-registering deleted projects (but that happens relatively infrequently right now).

Do we want to offer the ability to hard delete an item still? Or would soft deletes be the only thing available to end users?

I think we want to offer this via escalating to an admin (in the case that there is sensitive data in the release, perhaps?) but otherwise, no. Otherwise the confusion that this is trying to solve would still be possible, and there would be two different types of deletes for users to have to decide between.

I'm generally looking at this from a user's perspective of "adding a feature to restore something I deleted" rather than "changing hard delete to a soft delete".

Do we want to keep soft deleted items around forever, or should they “time out” after a period of time and get purged eventually?

I think they should stick around, mostly for the reasons I mentioned about deleting/transferring projects above.

Should someone who has just been added to the project be able to see all of the existing soft deleted items (and restore them) or should they only see a snapshot of what has been soft deleted since they’ve been added?

I think they should be able to see them, otherwise the permissions/views here are going to be based on when the owner was added and this is going to make it more complicated to implement. I think generally if you're trusting someone with the ability to delete your project, you can also trust with with the ability to undelete parts of it.

How Do we record these events in the journal for mirrors to use? Do they just look like normal deletes and reuploads or do they look like soft deletes and restores?

I think we can get by here with just the existing delete/upload journals. Mirrors shouldn't need to keep track of soft deletes, PyPI can just abstract that away for them.

This makes the assumption that they can handle a series of Upload -> Delete -> Upload for the same item, so we should probably check on that before releasing.

@di di removed the developer experience Anything that improves the experience for Warehouse devs label Aug 5, 2018
@dstufft
Copy link
Member

dstufft commented Aug 5, 2018

FWIW, the question about whether a new user gets access to the old soft deleted items came from #4457, which isn't exactly the same but is somewhat similar in that both deal with the edge case of a user gaining access to data that came from before they were added to a project. Particularly if the user soft deleted the project years ago, and a new person just now registered the name again.

@nlhkabu nlhkabu added the UX/UI design, user experience, user interface label Dec 25, 2018
@brainwane brainwane added the needs discussion a product management/policy issue maintainers and users should discuss label Mar 19, 2019
@di
Copy link
Member Author

di commented Mar 22, 2019

Here's a WIP branch where I started to implement this: https://github.com/pypa/warehouse/compare/master...di:soft-deletes?expand=1

I got hung up on the Project.releases field. Essentially, there's a number of places where we don't want soft-deleted releases included (public UI) but there are places where we do want all releases included (maintainer/admin UI). I wanted to find a way to ensure that we're not accidentally using one or the other, and to ensure that we have good-performing queries for both, but I didn't find this immediately easy to do with the ORM.

@brainwane
Copy link
Contributor

Per discussion today, we'll want this for the upcoming work on detecting & deleting malware. A soft delete feature will mitigate the possibility of erroneous deletions by admins. (Right now the actual thing we do is delete the database record, so a "deleted" file is still available and could be recovered, but that's a headache compared to the undelete button that a soft delete feature would provide.)

@brainwane
Copy link
Contributor

To clarify: this issue #4440 is not within scope of the current contract with OTF; it's foundational work for the next funded work milestone which will probably start a few months from now (but I don't have a specific date).

@brainwane
Copy link
Contributor

Now that PEP 592 is accepted and implemented #5837, what else do we need to do to resolve this issue and unblock #7421?

@di
Copy link
Member Author

di commented Jul 1, 2020

We should determine whether soft deleting a release/file affects the total project size or not. Might be confusing for users to delete one or more releases but not see the total size decrease.

@VikramJayanthi17
Copy link
Contributor

I have been working on this issue and have run into some challenges. I worked off of @di's WIP branch. We tried 2 approaches and discovered problems with both of them.

Approach 1:

  • Method : Make the File, Release, and Project class inherit a SoftDeletable class which adds a soft_deleted column. We then use SQL Alchemy's before_compile to automatically filter out projects, releases, and files that have soft_deleted == True. This would allow for existing queries to not have to worry about filtering soft deleted projects.
  • Problem : Warehouse's caching pattern causes this method to error as before_compile is called when the cache is attempting to find the project that's being soft-deleted and since the project has been soft deleted, before_compile filters it out. This causes the cache to not be able to find the soft-deleted project and it throws an "SQLAlchemy : ObjectDeletedError".

Approach 2:

  • Method : Create a DeletedProject table that is a copy of the Project table and add a trigger that is called when a Project is deleted which adds the deleted project to the DeletedProject table. I wasn't sure which SQL Alchemy inheritance method to use and decided to re-write the columns of Project into DeletedProject. This would theoretically allow for queries that want to get soft-deleted projects query an additional table to get them and existing queries not be affected by soft-deletes.
  • Problem : Because of Project's relationships with other tables this didn't work as there was no relationship between DeletedProject and the other tables. This makes me think that there's another way to copy all the columns of a table into another table via inheritance that maintains the relationships appropriately. I haven't worked with ORMs/SQL Alchemy before so I do think I am missing something here.

We want to avoid changing existing queries while implementing this feature. Any advice/ideas on how to approach this issue generally or alter these approaches to make them work would be appreciated, thanks.

@ewdurbin
Copy link
Member

ewdurbin commented Jul 9, 2020

@VikramJayanthi17 Approach 1 sounds most straightforward to me. I'm not aware of a cache being involved.

Do you have some example code you could share? I'm wondering if perhaps the ObjectDeletedError could be transformed into the correct 404.

@VikramJayanthi17
Copy link
Contributor

VikramJayanthi17 commented Jul 9, 2020

This is the stack trace:

Traceback (most recent call last):
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_debugtoolbar/panels/performance.py", line 58, in resource_timer_handler
     result = handler(request)
  File "/opt/warehouse/src/warehouse/raven.py", line 40, in raven_tween
     return handler(request)
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_tm/__init__.py", line 178, in tm_tween
     reraise(*exc_info)
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_tm/compat.py", line 36, in reraise
     raise value
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_tm/__init__.py", line 159, in tm_tween
     return _finish(request, manager.commit, response)
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_tm/__init__.py", line 103, in _finish
     reraise(*exc_info)
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_tm/compat.py", line 36, in reraise
     raise value
  File "/opt/warehouse/lib/python3.8/site-packages/pyramid_tm/__init__.py", line 83, in _finish
     finisher()
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_manager.py", line 135, in commit
     return self.get().commit()
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_transaction.py", line 282, in commit
     reraise(t, v, tb)
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_compat.py", line 45, in reraise
     raise value
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_transaction.py", line 273, in commit
     self._commitResources()
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_transaction.py", line 465, in _commitResources
     reraise(t, v, tb)
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_compat.py", line 45, in reraise
     raise value
  File "/opt/warehouse/lib/python3.8/site-packages/transaction/_transaction.py", line 437, in _commitResources
     rm.tpc_begin(self)
  File "/opt/warehouse/lib/python3.8/site-packages/zope/sqlalchemy/datamanager.py", line 112, in tpc_begin
     self.session.flush()
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2523, in flush
     self._flush(objects)
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2664, in _flush
     transaction.rollback(_capture_exception=True)
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
     compat.raise_(
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
     raise exception
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2628, in _flush
     self.dispatch.after_flush(self, flush_context)
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 320, in __call__
     fn(*args, **kw)
  File "/opt/warehouse/src/warehouse/cache/origin/__init__.py", line 42, in store_purge_keys
     purges.update(key_maker(obj).purge)
  File "/opt/warehouse/src/warehouse/cache/origin/__init__.py", line 122, in key_maker
     cache=[k.format(obj=obj) for k in cache_keys],
  File "/opt/warehouse/src/warehouse/cache/origin/__init__.py", line 122, in <listcomp>
     cache=[k.format(obj=obj) for k in cache_keys],
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 287, in __get__
     return self.impl.get(instance_state(instance), dict_)
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 718, in get
     value = state._load_expired(state, passive)
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/state.py", line 652, in _load_expired
     self.manager.deferred_scalar_loader(self, toload)
  File "/opt/warehouse/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 1018, in load_scalar_attributes
     raise orm_exc.ObjectDeletedError(state) sqlalchemy.orm.exc.ObjectDeletedError: Instance '<Project at 0x7fc244bac880>' has been deleted, or its row is otherwise not present.

The code can be found on this branch : https://github.com/VikramJayanthi17/warehouse/tree/soft-deletes. before_compile in packaging/models.py and delete_project in manage/views.py contain the logic for filtering soft-deletes and soft-deleting projects respectively.

Thanks for the help @ewdurbin and let me know if there is any other information I can provide.

@di
Copy link
Member Author

di commented Jul 9, 2020

The issue is with our cache-purging mechanism (https://github.com/pypa/warehouse/blob/master/warehouse/cache/origin/__init__.py), specifically that the instance it's trying to generate keys for, which has just been modified, has just been modified in a way that causes it to be excluded from any query due to before_compile, so SQLAlchemy thinks something else has removed it from the DB.

@ewdurbin
Copy link
Member

ewdurbin commented Jul 9, 2020

Ah! Thanks for context @di.

@VikramJayanthi17 would you be OK opening a Draft PR from that branch? Easier to comment/track suggestions that way.

@VikramJayanthi17
Copy link
Contributor

@ewdurbin @di Created a draft PR #8237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss UX/UI design, user experience, user interface
Projects
None yet
Development

No branches or pull requests

6 participants