-
-
Notifications
You must be signed in to change notification settings - Fork 199
[ADD] fs_attachment_s3_migration: enable S3 migration #534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 16.0
Are you sure you want to change the base?
[ADD] fs_attachment_s3_migration: enable S3 migration #534
Conversation
Module for migration existing filestore attachments into the S3 backend with queue_job orchestration
| ("res_field", "=", False), | ||
| ("res_field", "!=", False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need such domain leaf? Cannot it be simply omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This is a required workaround for Odoo's ir.attachment._search method
which automatically adds ('res_field', '=', False) to any domain that doesn't
already contain 'res_field'. Without this tautology, only attachments where
res_field=False would be found, missing all field-linked attachments.
I'll add an explicit comment explaining this.
| if not checksum: | ||
| return None | ||
|
|
||
| rf_domain = ["|", ("res_field", "=", False), ("res_field", "!=", False)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question about domain here. If this is needed on purpose, then there should be an explicit comment explaining the reason behind it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above.
| This module lets users move existing `ir.attachment` files from the | ||
| standard filestore or database into an Amazon S3-backed `fs.storage`, using a | ||
| wizard directly on the storage form. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrations are run in background batches, skipping attachments that are already stored in S3 or must remain in PostgreSQL.
This allows to run the process repeatedly avoiding creating duplicates.
| * Cetmix (cetmix.com) | ||
| * Ivan Sokolov | ||
| * George Smirnov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cetmix (cetmix.com)
- Ivan Sokolov
- George Smirnov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting is neither documented nor explained, however it's a very important one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll add comment a comment in the XML explaining the channel purpose.
| storage_id = fields.Many2one("fs.storage", required=True) | ||
| storage_code = fields.Char(required=True) | ||
| batch_size = fields.Integer(default=500) | ||
| channel = fields.Char(default="root.s3_migration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a good idea to hardcorde it this way, especially takin into account the fact that you are using a data file to add it.
You should assign the default values using a function that populates the env.ref("fs_attachment_s3_migration.queue_channel_s3_migration") value.
| default=0, | ||
| ) | ||
|
|
||
| @api.onchange("storage_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend to avoid using @onchange and use compared stored fields with readonly=False instead.
This will simplify the usage as part of automation including external calls.
|
|
||
| _logger.info( | ||
| "Completed batch migration: %d/%d attachments to storage %s (%d skipped). " | ||
| "Old files will be cleaned by garbage collection.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the garbage collector
| max_batches or "unlimited", | ||
| ) | ||
|
|
||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have some concerns regarding this approach.
Pros
Can run multiple jobs in parallel -> faster
Cons
Jobs are created instantly using the current DB snapshot. So as soon as the migration is started no further changes are taken into account.
Eg if some attachments were removed they will be processed anyway even if this is not needed any longer.
When run in parallel several batches can access the same physical files (same checksum). If we modify the data in one of those batches the modification will not be saved in the database until the transaction is committed. So we either need to commit these changes explicitly (questionable) or prepare the batches checksum-grouped so attachments with the same checksum are always in the same batch.
Conclusion
Personally I would prefer to have those batches enqueued one after one to avoid potential issues and also always have an up-to date database snapshot being used. However I may be missing something here, so let's research this part better.
|
|
||
| # Avoid empty writes if source is temporarily unreadable | ||
| if attachment.file_size and not file_data: | ||
| resolved = self._s3_resolve_migration_bytes(attachment, storage_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leads to a query being run N times which significantly degrades the performance especially on databases with large number of attachments.
I would suggest to fetch the data for the entire batch (or even the entire db 🤔 ) using read_group and then iterate it.
2611c77 to
ec02e7a
Compare
Helper module that migrates existing attachments to S3.