Skip to content

Added hash based check to prevent image duplication on product import #21146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

erfanimani
Copy link
Contributor

@erfanimani erfanimani commented Feb 12, 2019

Description

I've added a MD5 based hash check to prevent image duplication upon successive product CSV imports.

Note that this is a proof of concept, even though it works — sort of.

The problem is that the Add/Update product import method should be declarative, and thus idempotent — it should not cause side effects when importing the same CSV multiple times.

As it stands now, this is not the case (confirmed), running the import multiple times causes images to be added. The "Replace" import method is beyond silly for production systems as it trashes the old IDs (and takes quotes, reports, wish lists, comparison lists, and so on with it).

We also can't replace just the images as it causes the different filenames, thus realistically a filename + hash comparison is the way to go. But filename checks are hard because we're not keeping track of the file metadata upon import (and as the file is renamed when Magento moves it into media/catalog/product, any reference to the old filename is lost) — so the easiest solution that kinda works is a hash based check.

I want to know whether this is the way to implement it or not. It's already a messy, mostly undocumented, class and no straight forward way to implement it properly, but any guidance would be helpful.

  • I haven't fully tested this yet
  • I'm not happy with code quality yet
  • Doesn't delete images, only skips addition of existing images
  • Doesn't rename the files — only compares file hash (to do this, I believe, would require Magento to store file metadata)
  • Does not work when images are referenced using HTTP
  • Crashes if image was not found (Warning: md5_file: failed to open stream..)

Fixed Issues

Manual testing scenarios (*)

  1. Import product CSV using Add/Update method multiple times
  2. Check if image duplication happens or not

There's many testing scenario's actually. Some of them being:

  • Use HTTP references in images
  • Same filename different file
  • Different filename same file
  • Same file path for multiple products
  • Change image labels/titles, ensure they're updated
  • Manipulate thumbnail, small image and base image tags, and ensure it's updated properly after import
  • etc.

Please let me know how to proceed and I can amend my commit message and implement it properly.

@magento-engcom-team
Copy link
Contributor

Hi @erfanimani. Thank you for your contribution
Here is some useful tips how you can test your changes using Magento test environment.
Add the comment under your pull request to deploy test or vanilla Magento instance:

  • @magento-engcom-team give me test instance - deploy test instance based on PR changes
  • @magento-engcom-team give me 2.3-develop instance - deploy vanilla Magento instance

For more details, please, review the Magento Contributor Assistant documentation

@raivisdejus
Copy link

@erfanimani I have used your PR and added image deletion to it #21855

Thanks for the PR, saved me quite a lot of time

*
* @param array $images
*/
public function addImageHashes(&$images) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, make a method private and add strict types ti the method

*
* @return string
*/
protected function getImportDir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, make a method private and add strict types ti the method

@ghost ghost assigned sidolov Aug 23, 2019
@sidolov
Copy link
Contributor

sidolov commented Sep 6, 2019

@erfanimani , I am closing this PR now due to inactivity.
Please reopen and update if you wish to continue.
Thank you for the collaboration!

@sidolov sidolov closed this Sep 6, 2019
@m2-assistant
Copy link

m2-assistant bot commented Sep 6, 2019

Hi @erfanimani, thank you for your contribution!
Please, complete Contribution Survey, it will take less than a minute.
Your feedback will help us to improve contribution process.

@cptX
Copy link

cptX commented Mar 3, 2025

As I just realized that magento uses different copies of images for the same images when uploaded manually during new product creation I want to ask if the attempt here to solve the issue would work also when creating a new product and reupload the same images. Would this also use the hash check or the check runs only during csv import?

@erfanimani
Copy link
Contributor Author

@cptX it's been a while, but I believe it just affects the CSV import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants