Added 'retry_reads_on_master' option for when read slaves are used #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I recently supported the infrastructure launch on a Magento 2 site which sees high enough traffic levels to require the use of a read slave for the object cache. Without the slave, cache reads will saturate the 1Gbps interface between the web servers and the master Redis instance.
During pre-launch load testing, everything worked great and we reached proper traffic levels by simply adding a read slave to each of the web servers and configuring the
load_from_slaveoption on the cache backend. This eliminated the network interface as a bottleneck as well as lowered overall application response times by around 50-60 ms on average (because there is practically no latency to redis on cache hits now). But this was without a long-list of heavy hitting integrations that result in lots of cache purging activity for short periods of time (integrations which are out of our control).Once live we kept experiencing instances of application responses spiking. They were seemingly very random and didn't always correlate to said integrations, but did seem to be either timing wise around these integrations slamming the API and/or admins editing products in the backend.
When a cache flush occurs, there is always a cache flood, especially when under high traffic. The net effect of this is compounded when read slaves are used. Under the circumstances we experienced the performance degradation severe enough to grind things to a halt a couple times during a flash sale and the impact to the redis replication was a toppling effect resulting in floods of
LOADING Redis is loading the dataset in memoryerrors as they continued trying to perform full re-synchronizations due to the cache floods filling replication buffers of over 4 GB in size.The solution to this problem was to add support for the
retry_reads_on_mastersetting to make configurable the retry of reads on master when the read comes back from the slave as empty. This has been in production for 5 days now and the error has not reoccurred.The below is a screenshot of what happens to the object cache during these steady and repeated cache floods. The giant spike in the purple graph is not an increase in data, but a replication buffer filling up to over 4 GB of writes (from the cache floods)
The solution of simply retrying empty read responses on master has the effect of bringing the impact of a cache flood back to where it was when only a master redis instance was used, allows us to put the replication buffer back down to 768 mb (it could probably go even lower without issue, but we're stable now, so I'm not changing it).
These are the settings we used on the replicating redis instances:
Cache backend configuration in the
env.phpfile now looks like this:Let me know if there are questions. I'd love to see this merged in so it can trickle it's way down into the core and allow us to eventually remove our package override in the composer.json for this site. :)