Skip to content

"Clean Up Files" Feature #1023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
natanrolnik opened this issue Mar 14, 2016 · 62 comments · May be fixed by #8385
Open
3 tasks done

"Clean Up Files" Feature #1023

natanrolnik opened this issue Mar 14, 2016 · 62 comments · May be fixed by #8385
Labels
bounty:$100 Bounty applies for fixing this issue (Parse Bounty Program) type:feature New feature or improvement of existing feature

Comments

@natanrolnik
Copy link
Contributor

Make sure these boxes are checked before submitting your issue -- thanks for reporting issues back to Parse Server!

One of the features that I liked on the hosted Parse was, in the settings, the button Clean Up Files. This way, every file stored in S3 for example, that wasn't anymore referenced from a PFFile, would be deleted. I liked it specially because it allowed us to save on unused/unneeded resources.

Maybe a Rest call using the master key would be initially enough? In the future, with possible integration with the parse-dashboard?

I know it's lower priority compared to the features/fixes that are being developed, but that would be great to have.

@natanrolnik natanrolnik changed the title Allow deleting unused files "Clean Up Files" Mar 15, 2016
@natanrolnik natanrolnik changed the title "Clean Up Files" "Clean Up Files" Feature Mar 15, 2016
@gfosco
Copy link
Contributor

gfosco commented Mar 17, 2016

This would be pretty difficult actually, and would need to be built for each specific Files adapter. Right now, there's no 'listing' of what files exist through the adapter.

@natario1
Copy link

+1 , agree with the need.

@ckarmy
Copy link

ckarmy commented Jul 18, 2016

It's possible to clean the unused files stored in GridStore now?

@yorkwang
Copy link

+1, It's a very useful feature.

@Lokiitzz
Copy link

Lokiitzz commented Oct 3, 2016

+1, It would be nice.

@umair6
Copy link

umair6 commented Oct 19, 2016

+1

@abdulwasayabbasi
Copy link

+1 very much needed

@JoseVigil
Copy link

+1

@natario1
Copy link

natario1 commented Nov 23, 2016

Just asking: how many of you ever actually needed a file after deleting pointers to it?

I feel the most common use of files is “if I delete the pointer, I don’t need the file anymore”. If this is the case, why not make it the default in parse-server?

I mean that when any object is deleted, after delete, all files are processed and adapter.deleteFile() is called for each. This could be opt-in / out in the ParseServer constructor, and is way easier than a complete “clean up” feature.

Given how tricky the full task is, it would also be cool if parse-server kept a Files table with url and usage_count , to simplify all the rest.

@abdulwasayabbasi
Copy link

@natario1
Just to answer your question why I want to keep the files is because of the intermediate state
e.g.
Consider a mobile game using Parse backend for keeping some zip packages to be used in game
While we delete/replace with some new packages on parse dashboard, there will be a state for sometime where the users still having old package URL locally in their app/game will start having issues until new configs/URLS are loaded
But I also want to keep size of my database as small as possible, so at some right time I will delete some old packages

@natario1
Copy link

@abdulwasayabbasi makes sense, thank you. Just wondering how frequent that is.

Your use case would not be bothered by an ‘auto delete’ feature, since you are just updating the file field. To take advantage of it you would have to create a new object with the new package file, and delete the older object when you feel safe, so the old file gets auto deleted.

@Lokiitzz
Copy link

Lokiitzz commented Dec 5, 2016

I made my own "clean file". Maybe it could help someone!

https://gist.github.com/Lokiitzz/6afbf0573665d3170ffb1e83565a0fef

Be careful :)

@davimacedo
Copy link
Member

Why not a PR to Parse Server? :)

@flovilmart
Copy link
Contributor

The code won't work on the server as it loads all objects in memory.

@davimacedo
Copy link
Member

yes.. you're right. I didn't check it before.

@flovilmart
Copy link
Contributor

For those features, I'd love to see command line tools more than just another endpoint that require maintenance.

@mrmarcsmith
Copy link
Contributor

why not pass an "auto-delete-files" flag to the server on startup and when an individual file pointer is deleted or replaced it deletes the file? This feature would help the 50% of people who only use PFFiles for profile pictures (files that won't be needed after deletion or replacement) while leaving the other 50% who want fine grain control unaffected because they didn't pass the flag? Would this be a valid solution?

@funkenstrahlen
Copy link
Contributor

funkenstrahlen commented Feb 7, 2017

I also have this problem. Deleted a lot of rows in my mongodb database with parse dashboard including the reference to many images. Now I am unable to find them and clean them up. Or is there any other (manual) way?

I expected the parse dashboard to cleanup pffiles before it removes the reference to them.

@respectTheCode
Copy link

Any progress on this?

@flovilmart
Copy link
Contributor

Not yet, this is not a feature that is actively worked on, but pull requests or a separate project could take care of it

@respectTheCode
Copy link

Depending on how you look at this it is either an undocumented "feature" or a huge bug. Either way it has huge and expensive consequences that should at the very least be well documented.

@flovilmart
Copy link
Contributor

flovilmart commented Mar 30, 2017

undocumented "feature" or a huge bug

What do you mean by that?

This is neither documented nor a bug as it's just not implemented, neither listing the missing files, nor deleting an existing file through the file adapters.

Because a file could be referenced by multiple Objects, we don't keep a reference count on them.
Steps are relatively easy to describe:

  • list all files present in the DB
  • list all files present in the bucket/storage
  • for each file present in the storage, if missing from the list from files in the DB, delete from storage.

However, not trivial to implement.

@dblythy
Copy link
Member

dblythy commented Nov 11, 2020

What do you think of this approach @mtrezza?

FilesController.js

async cleanUpFiles(database) {
    if (!this.adapter.getFiles) {
      return;
    }
    const files = await this.adapter.getFiles(this.config);
    if (files.length == 0) {
      return;
    }
    const schema = await database.loadSchema();
    const all = await schema.getAllClasses();
    const classQueries = {};
    for (const field of all) {
      const fields = field.fields;

      for (const fname of Object.keys(fields)) {
        const fl = fields[fname];

        if (fl.type == 'File') {
          const classData = classQueries[field.className] || [];
          classData.push(fname);
          classQueries[field.className] = classData;
        }
      }
    }
    if (Object.keys(classQueries).length == 0) {
      return;
    }
    for (const file of files) {
      try {
        const promises = [];
        for (const className of Object.keys(classQueries)) {
          const keys = classQueries[className];
          const queries = [];
          for (const key of keys) {
            const query = new Parse.Query(className);
            query.equalTo(key, file);
            queries.push(query);
          }
          let orQuery = new Parse.Query(className);
          orQuery = Parse.Query.or.apply(orQuery, queries);
          orQuery.select('objectId');
          promises.push(orQuery);
        }
        const data = await Promise.all(promises.map(query => query.first({ useMasterKey: true })));
        let remove = true;
        for (const obj of data) {
          if (obj) {
            remove = false;
            break;
          }
        }
        if (!remove) {
          continue;
        }
        await file.destroy({ useMasterKey: true });
      } catch (e) {
        // ** //
      }
    }
  }

And then getFiles needs to be added to the adapter. For GridFS:

async getFiles(config) {
    const bucket = await this._getBucket();
    const files = [];
    const fileNamesIterator = await bucket.find().toArray();
    fileNamesIterator.forEach(({filename}) => {
      const file = new Parse.File(filename);
      file._url = this.getFileLocation(config, filename);
      files.push(file);
    });
    return files;
}

And then attached to a route in FilesRouter.js

Conceptually, this looks up schema for all classes, and then figures out which fields are files. Next, it queries those fields in the respective classes for each file, and if there's no reference, it removes it.

It takes about 2-3 min per 1000 files. Tested on my servers and works well. Could be faster, but I was conscious of query limits removing files by accident. I wanted to be 100% sure the file is unreferenced prior to deletion.

Related: #546, #6780

@davimacedo
Copy link
Member

It is a good start, but there are cases in which the files are not stored in a field of type File. Sometimes people store references to files in arrays and objects. I've also seen people just uploading the files and never referencing them in any other object. So I'm afraid of having this kind of script running automatically.

@dblythy
Copy link
Member

dblythy commented Nov 11, 2020

cases in which the files are not stored in a field of type File.

Hmmm, interesting. What do you think of:

Requiring the locations of the files in the /POST request to delete files, e.g:

{
     '_User' : [
        'photos' // if schema tells photos is array, change to containedIn
     ],
    'Photos' : [
        'photos.thumbnail' 
     ]
}

Or, perhaps add a callback in Parse.Cloud or something for whether file should delete if it's been flagged for "cleanup".

The only other solution I can think is to query every object and loop through fields to check for the file, which would be quite intensive.

Either way the warnings of the caveats will have to be shown in the dashboard / docs prior to running the function.

@davimacedo
Copy link
Member

Actually the current way only searching in the file fields is already very intensive depending on the size of the collections and how many files the app has. This is probably a script not to run in the parse-server process but probably via cli.

@mtrezza
Copy link
Member

mtrezza commented Nov 11, 2020

Sometimes people store references to files in arrays and objects.

I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.

I've also seen people just uploading the files and never referencing them in any other object

Are these files still needed or should they be cleaned up?

I'm afraid of having this kind of script running automatically.

I agree. Such a script should not run automatically (without control of schedule and batch size anyway), because these mass queries can have a significant performance impact / cost implication on external resources.

Other thoughts:

  • How does this script scale, e.g for a S3 bucket with 10 million files or a MongoDB collection with 10 millions docs?
  • Do the queries in the script need any indices for efficiency? This can get quite complex when file references are searched in nested arrays and objects.
  • How is this script supposed to be invoked, e.g. via API trigger in a dedicated server instance?

@dblythy
Copy link
Member

dblythy commented Nov 11, 2020

I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.

I agree, it should be implicitly stated the risks / caveats, so people that store files in more complex structures understand not to use the cleanup, or the risks associated with running /cleanupfiles.

I'm afraid of having this kind of script running automatically.

I'd gather it would be a button in the dashboard (as with parse.com), that would be run once every month or so. I wouldn't propose running it unless the developer directly enacts it.

  • How does this script scale, e.g for a S3 bucket with 10 million files and a MongoDB collections with 5 millions docs?

Honestly, I wouldn't imagine it would be great, especially with configurations with multiple "_File" fields in schemas, as it queries files and classes one by one. I'd previously written it to use containedIn, but again was worried about query limits not returning the objects associated. I would imagine it would take a while, and would be a background task. (E.g "we're now cleaning up your files").

  • Do the queries in the script need any indices for efficiency?

I would imagine that would speed up the cleanup time. Maybe we could recommend creating indexes on File fields if you're using a cleanup?

Would running all the individual queries of the individual objects in parallel speed it up? Also is it worth removing await from the destroy command, so the script can keep looping through the files?

*How is this script supposed to be invoked, e.g. via API trigger in a dedicated server instance?

Via an API trigger:

router.post(
      '/files/cleanupfiles',
      Middlewares.handleParseHeaders,
      Middlewares.enforceMasterKeyAccess,
      this.cleanupHandler
);

@davimacedo
Copy link
Member

I'd not go with an api route. This process should not run in the same process of Parse Server. It may cause the app to be unresponsive in the case of an app with a large amount of files / objects.

I agree with a first simple version but we do need to make sure that there is a big alert for the developers before firing the script. If via dashboard, it should something like we have currently in place for deleting all rows in a class.

The caveat here is not only files not being deleted for a more complex structure, but a lot of files will actually be deleted by accident in a more complex structure.

We need to have in mind that the files feature is not only supposed to be used as referenced files. It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object. It is a valid feature, but we need to make sure that the developers know what they are doing.

Also, let's first agree about the api and how this feature will work and I may have some code to share.

@davimacedo
Copy link
Member

A lot of ideas can be seen in this project: https://github.com/parse-server-modules/parse-files-utils

It is an old project but it has some code in place to search for all files in all objects of an app.

@davimacedo
Copy link
Member

@mtrezza I believe we should reopen this issue, right? What is the new procedure?

@mtrezza mtrezza reopened this Nov 11, 2020
@mtrezza
Copy link
Member

mtrezza commented Nov 11, 2020

@davimacedo Yes, thanks, the procedure is re-open and remove the up for grabs label when someone is actively working on it.

@mtrezza
Copy link
Member

mtrezza commented Nov 11, 2020

@davimacedo

I'd not go with an api route. This process should not run in the same process of Parse Server.

My first thought was that this script should not even be part of Parse Server, but an external tool. But then I thought we could make it part of Parse Server for convenience and advice developers to spin up a new, dedicated instance of Parse Server that does not take any app requests for this purpose. Like a LiveQuery server.

If via dashboard, it should something like we have currently in place for deleting all rows in a class.

Yes, it should definitely be more than a simple "Are you sure? Yes/No" dialog, with the infos:

  • it will delete any files that are not referenced in a File pointer field
  • it may have a severe performance impact on the running Parse Server instance
  • it may require creating indices in the DB (if the script can't do that)
  • it may have a sever performance impact on the DB
  • it may have a cost implication on external resources (such as a huge file list query, although I think S3 only charges for data traffic?)

It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object.

Do you have any example use cases in mind for unreferenced files in a storage bucket, so we can get a better feel for how many deployments would be affected? I can only think of files like logs that are stored for manual retrieval, or maybe the files are processed automatically by a script of the storage bucket provider. All rare use cases I think.

I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system.

@davimacedo
Copy link
Member

I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system.

That's why I'd not go with the script in the api. It will be only a matter of time for people to start complaining about the script not working. The same happened with the push notifications system. It took a long time to have a scalable process because previously it was a single parse server instance trying to handle all pushes.

For this to be scalable in the api, we'd need to to a similar approach to the one in push notifications. Break the files in small sets, put those sets on a queue and run multiple processes consuming the sets and processing one by one. Even though we are talking about something that will be complex to be written and also to be deployed.

@mtrezza
Copy link
Member

mtrezza commented Nov 11, 2020

Good points. @dblythy can you find anything reusable in the files utils repo that has been mentioned before?

@dblythy
Copy link
Member

dblythy commented Nov 11, 2020

I had a quick look through it and it seems to use a similar search algorithm as I wrote (lookup schema and look for “File”). I can have a more detailed look at that and also how the push notifications approach is done and work towards a cleanup feature similar to that.

@c0sm1cdus7
Copy link

Was this ever implemented?

@dblythy
Copy link
Member

dblythy commented May 12, 2021

The main reason this stalled was because figuring out whether a file is an "orphan" (as in it is not associated to any parent object) is entirely dependant on the way that files are associated with objects. As a file can be set to object.field, object.field[2].key.field[2], can be associated by relations, pointers, etc, it becomes rather difficult for an inbuilt algorithm to decide whether a file is to be deleted or not.

If you're familiar with how your Parse Server determines file associations, you can do something similar to this:

const Config = require('./node_modules/parse-server/lib/Config');
const app = Config.get('appId');
const bucket = await app.database.adapter._getBucket();
const files = [];
const fileNamesIterator = await bucket.find().toArray();
fileNamesIterator.forEach(({filename}) => {
  const file = new Parse.File(filename);
  file._url = config.filesController.adapter.getFileLocation(config, filename);
  files.push(file);
});
// loop through files and check if they have any association. If not, delete.

@dblythy
Copy link
Member

dblythy commented Aug 20, 2021

I think the main conceptual challenge here is:

  1. How to work out which files are orphans, which could be difficult with more complex schemas, such as nested files.
  2. How to make the feature backwards compatible so that references for existing files are tracked.

I'm thinking:

  1. Could let the developer write some sort of Parse.Cloud.checkFileParents which the developer is responsible for querying for the file in existing data.
  2. Could begin new file references from new version, and encourage developers to manually write queries and saves to build the "FileObject" storage.

@mtrezza
Copy link
Member

mtrezza commented Aug 20, 2021

Good analysis @dblythy. Could we break this down into a minimum viable feature with some limitations?

  • The first version doesn't have to cover all special cases of how to reference a file, but only the basic way of file pointers.
  • Later on we can just add an API to manually change the counter; it would be the developer's responsibility to manipulate the counter correctly, according to their use case; this also gives the most feature versatility.
  • Later on we could also cover some special reference cases (pointer arrays, or whatever).

@dblythy
Copy link
Member

dblythy commented Aug 20, 2021

I think that’s a good point. Perhaps for most users, being able to have a collection of their files (uploaded by, view count, etc) visible in the dashboard would probably be an improvement.

We could add that the counter will only be accurate for simple data schemas, and leave the deleting of files up to the developer.

There could be the potential to bake in some common use cases, such as unique profile picture management.

@mtrezza
Copy link
Member

mtrezza commented Aug 20, 2021

There could be the potential to bake in some common use cases, such as unique profile picture management.

I like that very much, like with hashes to not upload the same file multiple times, but reference the existing file? I think that would be a very practical use case. Yes, we can see this has a lot of potential, so getting a very basic first feature version released would be a good start.

@mtrezza mtrezza added the bounty:$100 Bounty applies for fixing this issue (Parse Bounty Program) label Oct 5, 2021
@mtrezza mtrezza added type:feature New feature or improvement of existing feature and removed type:improvement labels Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty:$100 Bounty applies for fixing this issue (Parse Bounty Program) type:feature New feature or improvement of existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.