-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
"Clean Up Files" Feature #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This would be pretty difficult actually, and would need to be built for each specific Files adapter. Right now, there's no 'listing' of what files exist through the adapter. |
+1 , agree with the need. |
It's possible to clean the unused files stored in GridStore now? |
+1, It's a very useful feature. |
+1, It would be nice. |
+1 |
+1 very much needed |
+1 |
Just asking: how many of you ever actually needed a file after deleting pointers to it? I feel the most common use of files is “if I delete the pointer, I don’t need the file anymore”. If this is the case, why not make it the default in parse-server? I mean that when any object is deleted, after delete, all files are processed and Given how tricky the full task is, it would also be cool if parse-server kept a Files table with url and usage_count , to simplify all the rest. |
@natario1 |
@abdulwasayabbasi makes sense, thank you. Just wondering how frequent that is. Your use case would not be bothered by an ‘auto delete’ feature, since you are just updating the file field. To take advantage of it you would have to create a new object with the new package file, and delete the older object when you feel safe, so the old file gets auto deleted. |
I made my own "clean file". Maybe it could help someone! https://gist.github.com/Lokiitzz/6afbf0573665d3170ffb1e83565a0fef Be careful :) |
Why not a PR to Parse Server? :) |
The code won't work on the server as it loads all objects in memory. |
yes.. you're right. I didn't check it before. |
For those features, I'd love to see command line tools more than just another endpoint that require maintenance. |
why not pass an "auto-delete-files" flag to the server on startup and when an individual file pointer is deleted or replaced it deletes the file? This feature would help the 50% of people who only use PFFiles for profile pictures (files that won't be needed after deletion or replacement) while leaving the other 50% who want fine grain control unaffected because they didn't pass the flag? Would this be a valid solution? |
I also have this problem. Deleted a lot of rows in my mongodb database with parse dashboard including the reference to many images. Now I am unable to find them and clean them up. Or is there any other (manual) way? I expected the parse dashboard to cleanup pffiles before it removes the reference to them. |
Any progress on this? |
Not yet, this is not a feature that is actively worked on, but pull requests or a separate project could take care of it |
Depending on how you look at this it is either an undocumented "feature" or a huge bug. Either way it has huge and expensive consequences that should at the very least be well documented. |
What do you mean by that? This is neither documented nor a bug as it's just not implemented, neither listing the missing files, nor deleting an existing file through the file adapters. Because a file could be referenced by multiple Objects, we don't keep a reference count on them.
However, not trivial to implement. |
What do you think of this approach @mtrezza? FilesController.js async cleanUpFiles(database) {
if (!this.adapter.getFiles) {
return;
}
const files = await this.adapter.getFiles(this.config);
if (files.length == 0) {
return;
}
const schema = await database.loadSchema();
const all = await schema.getAllClasses();
const classQueries = {};
for (const field of all) {
const fields = field.fields;
for (const fname of Object.keys(fields)) {
const fl = fields[fname];
if (fl.type == 'File') {
const classData = classQueries[field.className] || [];
classData.push(fname);
classQueries[field.className] = classData;
}
}
}
if (Object.keys(classQueries).length == 0) {
return;
}
for (const file of files) {
try {
const promises = [];
for (const className of Object.keys(classQueries)) {
const keys = classQueries[className];
const queries = [];
for (const key of keys) {
const query = new Parse.Query(className);
query.equalTo(key, file);
queries.push(query);
}
let orQuery = new Parse.Query(className);
orQuery = Parse.Query.or.apply(orQuery, queries);
orQuery.select('objectId');
promises.push(orQuery);
}
const data = await Promise.all(promises.map(query => query.first({ useMasterKey: true })));
let remove = true;
for (const obj of data) {
if (obj) {
remove = false;
break;
}
}
if (!remove) {
continue;
}
await file.destroy({ useMasterKey: true });
} catch (e) {
// ** //
}
}
} And then async getFiles(config) {
const bucket = await this._getBucket();
const files = [];
const fileNamesIterator = await bucket.find().toArray();
fileNamesIterator.forEach(({filename}) => {
const file = new Parse.File(filename);
file._url = this.getFileLocation(config, filename);
files.push(file);
});
return files;
} And then attached to a route in Conceptually, this looks up schema for all classes, and then figures out which fields are files. Next, it queries those fields in the respective classes for each file, and if there's no reference, it removes it. It takes about 2-3 min per 1000 files. Tested on my servers and works well. Could be faster, but I was conscious of query limits removing files by accident. I wanted to be 100% sure the file is unreferenced prior to deletion. |
It is a good start, but there are cases in which the files are not stored in a field of type File. Sometimes people store references to files in arrays and objects. I've also seen people just uploading the files and never referencing them in any other object. So I'm afraid of having this kind of script running automatically. |
Hmmm, interesting. What do you think of: Requiring the locations of the files in the /POST request to delete files, e.g:
Or, perhaps add a callback in Parse.Cloud or something for whether file should delete if it's been flagged for "cleanup". The only other solution I can think is to query every object and loop through fields to check for the file, which would be quite intensive. Either way the warnings of the caveats will have to be shown in the dashboard / docs prior to running the function. |
Actually the current way only searching in the file fields is already very intensive depending on the size of the collections and how many files the app has. This is probably a script not to run in the parse-server process but probably via cli. |
I think if we can get to a PR that covers probably the most common case which is storing a file in a field of type File, we would already make many people happy. Maybe other creative ways of storing files can be addressed in a follow-up PR.
Are these files still needed or should they be cleaned up?
I agree. Such a script should not run automatically (without control of schedule and batch size anyway), because these mass queries can have a significant performance impact / cost implication on external resources. Other thoughts:
|
I agree, it should be implicitly stated the risks / caveats, so people that store files in more complex structures understand not to use the cleanup, or the risks associated with running /cleanupfiles.
I'd gather it would be a button in the dashboard (as with parse.com), that would be run once every month or so. I wouldn't propose running it unless the developer directly enacts it.
Honestly, I wouldn't imagine it would be great, especially with configurations with multiple "_File" fields in schemas, as it queries files and classes one by one. I'd previously written it to use containedIn, but again was worried about query limits not returning the objects associated. I would imagine it would take a while, and would be a background task. (E.g "we're now cleaning up your files").
I would imagine that would speed up the cleanup time. Maybe we could recommend creating indexes on File fields if you're using a cleanup? Would running all the individual queries of the individual objects in parallel speed it up? Also is it worth removing
Via an API trigger:
|
I'd not go with an api route. This process should not run in the same process of Parse Server. It may cause the app to be unresponsive in the case of an app with a large amount of files / objects. I agree with a first simple version but we do need to make sure that there is a big alert for the developers before firing the script. If via dashboard, it should something like we have currently in place for deleting all rows in a class. The caveat here is not only files not being deleted for a more complex structure, but a lot of files will actually be deleted by accident in a more complex structure. We need to have in mind that the files feature is not only supposed to be used as referenced files. It is a file repository and those files may never be referenced by any object. We are building a feature that conceptually is the same thing of building a feature to automatically delete all objects of a class that are not referenced by any other object. It is a valid feature, but we need to make sure that the developers know what they are doing. Also, let's first agree about the api and how this feature will work and I may have some code to share. |
A lot of ideas can be seen in this project: https://github.com/parse-server-modules/parse-files-utils It is an old project but it has some code in place to search for all files in all objects of an app. |
@mtrezza I believe we should reopen this issue, right? What is the new procedure? |
@davimacedo Yes, thanks, the procedure is re-open and remove the |
My first thought was that this script should not even be part of Parse Server, but an external tool. But then I thought we could make it part of Parse Server for convenience and advice developers to spin up a new, dedicated instance of Parse Server that does not take any app requests for this purpose. Like a LiveQuery server.
Yes, it should definitely be more than a simple "Are you sure? Yes/No" dialog, with the infos:
Do you have any example use cases in mind for unreferenced files in a storage bucket, so we can get a better feel for how many deployments would be affected? I can only think of files like logs that are stored for manual retrieval, or maybe the files are processed automatically by a script of the storage bucket provider. All rare use cases I think. I think the current script is more a proof of concept. It is not scalable and would almost certainly crash/block the DB for an unacceptable amount of time of any serious sized production system. |
That's why I'd not go with the script in the api. It will be only a matter of time for people to start complaining about the script not working. The same happened with the push notifications system. It took a long time to have a scalable process because previously it was a single parse server instance trying to handle all pushes. For this to be scalable in the api, we'd need to to a similar approach to the one in push notifications. Break the files in small sets, put those sets on a queue and run multiple processes consuming the sets and processing one by one. Even though we are talking about something that will be complex to be written and also to be deployed. |
Good points. @dblythy can you find anything reusable in the files utils repo that has been mentioned before? |
I had a quick look through it and it seems to use a similar search algorithm as I wrote (lookup schema and look for “File”). I can have a more detailed look at that and also how the push notifications approach is done and work towards a cleanup feature similar to that. |
Was this ever implemented? |
The main reason this stalled was because figuring out whether a file is an "orphan" (as in it is not associated to any parent object) is entirely dependant on the way that files are associated with objects. As a file can be set to If you're familiar with how your Parse Server determines file associations, you can do something similar to this: const Config = require('./node_modules/parse-server/lib/Config');
const app = Config.get('appId');
const bucket = await app.database.adapter._getBucket();
const files = [];
const fileNamesIterator = await bucket.find().toArray();
fileNamesIterator.forEach(({filename}) => {
const file = new Parse.File(filename);
file._url = config.filesController.adapter.getFileLocation(config, filename);
files.push(file);
});
// loop through files and check if they have any association. If not, delete. |
I think the main conceptual challenge here is:
I'm thinking:
|
Good analysis @dblythy. Could we break this down into a minimum viable feature with some limitations?
|
I think that’s a good point. Perhaps for most users, being able to have a collection of their files (uploaded by, view count, etc) visible in the dashboard would probably be an improvement. We could add that the counter will only be accurate for simple data schemas, and leave the deleting of files up to the developer. There could be the potential to bake in some common use cases, such as unique profile picture management. |
I like that very much, like with hashes to not upload the same file multiple times, but reference the existing file? I think that would be a very practical use case. Yes, we can see this has a lot of potential, so getting a very basic first feature version released would be a good start. |
Make sure these boxes are checked before submitting your issue -- thanks for reporting issues back to Parse Server!
One of the features that I liked on the hosted Parse was, in the settings, the button
Clean Up Files
. This way, every file stored in S3 for example, that wasn't anymore referenced from aPFFile
, would be deleted. I liked it specially because it allowed us to save on unused/unneeded resources.Maybe a Rest call using the master key would be initially enough? In the future, with possible integration with the parse-dashboard?
I know it's lower priority compared to the features/fixes that are being developed, but that would be great to have.
The text was updated successfully, but these errors were encountered: