Added 'retry_reads_on_master' option for when read slaves are used #134

davidalger · 2018-09-21T14:00:20Z

I recently supported the infrastructure launch on a Magento 2 site which sees high enough traffic levels to require the use of a read slave for the object cache. Without the slave, cache reads will saturate the 1Gbps interface between the web servers and the master Redis instance.

During pre-launch load testing, everything worked great and we reached proper traffic levels by simply adding a read slave to each of the web servers and configuring the load_from_slave option on the cache backend. This eliminated the network interface as a bottleneck as well as lowered overall application response times by around 50-60 ms on average (because there is practically no latency to redis on cache hits now). But this was without a long-list of heavy hitting integrations that result in lots of cache purging activity for short periods of time (integrations which are out of our control).

Once live we kept experiencing instances of application responses spiking. They were seemingly very random and didn't always correlate to said integrations, but did seem to be either timing wise around these integrations slamming the API and/or admins editing products in the backend.

When a cache flush occurs, there is always a cache flood, especially when under high traffic. The net effect of this is compounded when read slaves are used. Under the circumstances we experienced the performance degradation severe enough to grind things to a halt a couple times during a flash sale and the impact to the redis replication was a toppling effect resulting in floods of LOADING Redis is loading the dataset in memory errors as they continued trying to perform full re-synchronizations due to the cache floods filling replication buffers of over 4 GB in size.

The solution to this problem was to add support for the retry_reads_on_master setting to make configurable the retry of reads on master when the read comes back from the slave as empty. This has been in production for 5 days now and the error has not reoccurred.

The below is a screenshot of what happens to the object cache during these steady and repeated cache floods. The giant spike in the purple graph is not an increase in data, but a replication buffer filling up to over 4 GB of writes (from the cache floods)

The solution of simply retrying empty read responses on master has the effect of bringing the impact of a cache flood back to where it was when only a master redis instance was used, allows us to put the replication buffer back down to 768 mb (it could probably go even lower without issue, but we're stable now, so I'm not changing it).

These are the settings we used on the replicating redis instances:

repl-diskless-sync yes
repl-backlog-size 256mb
client-output-buffer-limit slave 768mb 768mb 0

Cache backend configuration in the env.php file now looks like this:

  'cache' => 
  array (
    'frontend' => 
    array (
      'default' => 
      array (
        'backend' => 'Cm_Cache_Backend_Redis',
        'backend_options' => 
        array (
          'server' => '940365-masterdb',
          'port' => '6379',
          'load_from_slave' => 'tcp://127.0.0.1:6379',
          'retry_reads_on_master' => '1',
        ),
      ),
    ),
  ),

Let me know if there are questions. I'd love to see this merged in so it can trickle it's way down into the core and allow us to eventually remove our package override in the composer.json for this site. :)

…on master when slave returns empty result

colinmollenhour · 2018-09-21T21:40:17Z

Hi David, have you tried this?
https://gist.github.com/colinmollenhour/7a91c4a92ccfd2adaeb6

It makes a huge difference for stability but with a high-traffic site you still want to avoid ever clearing the config cache but instead replace it.

The PR looks good though so no objection to merging it.

davidalger · 2018-09-22T15:31:50Z

@colinmollenhour This site is running Magento 2 not Magento 1 so I haven't tried that. I may look through it though and see how difficult it would be to port it to M2 if we continue running into issues here, so thanks for the tip nonetheless!

colinmollenhour · 2018-09-24T16:02:02Z

Ahh yes, you did say M2. No idea what the config cache loading looks like in M2 but the three major issues I've seen with M1 config cache are:

Clearing config cache should be avoided by using atomic update (e.g. for M1 my cleanCache.php script or n98-magerun command)
Config cache should be set to never expire (as long as invalidation is correct there is no need to expire it)
Code should be patched to prevent cold cache from causing stampede (like this patch - sleep and retry until cache is warm)

davidalger · 2018-09-27T14:31:58Z

@colinmollenhour Thanks for merging this. Would it be possible to tag a new release version so I can send a PR updating the composer.json dependency on the Magento 2 project? Something tells me specifying a commit hash or dev-master won't fly.

colinmollenhour · 2018-09-27T14:35:12Z

Tagged 1.10.6

See colinmollenhour/Cm_Cache_Backend_Redis#134 for details on updates to library

boldhedgehog · 2018-10-03T08:31:10Z

@davidalger thanks for your work.
We still experience the "LOADING" issue, however. You fix only checks if the response is a FALSE, but in our case Credis library throws and Exception, and it is not handled, and the read is not retried from master.

                if($this->isMulti || $this->usePipeline) {
                    $response = FALSE;
                } else if ($name == 'evalsha' && substr($reply,0,9) == '-NOSCRIPT') {
                    $response = NULL;
                } else {
                    throw new CredisException(substr($reply,0,4) == '-ERR' ? substr($reply, 5) : substr($reply,1));
                }

As you may see here, if the request is neither pipelined nor multi an exception is thrown, and what I saw from the code hGet() call does not match the condition.

Do you have any advice for me?

Added 'retry_reads_on_master' option to support optional read replay …

326adc3

…on master when slave returns empty result

colinmollenhour merged commit cc941a5 into colinmollenhour:master Sep 24, 2018

davidalger added a commit to davidalger/magento2 that referenced this pull request Sep 27, 2018

Update colinmollenhour/cache-backend-redis from version 1.10.5 to 1.10.6

32058c7

See colinmollenhour/Cm_Cache_Backend_Redis#134 for details on updates to library

davidalger mentioned this pull request Sep 27, 2018

Update colinmollenhour/cache-backend-redis from version 1.10.5 to 1.10.6 magento/magento2#18294

Merged

4 tasks

davidalger deleted the verify-slave-load-from-master branch October 3, 2018 13:19

colinmollenhour mentioned this pull request Dec 29, 2022

Handle LOADING state as if key was not found. #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added 'retry_reads_on_master' option for when read slaves are used #134

Added 'retry_reads_on_master' option for when read slaves are used #134

Uh oh!

davidalger commented Sep 21, 2018

Uh oh!

colinmollenhour commented Sep 21, 2018

Uh oh!

davidalger commented Sep 22, 2018

Uh oh!

colinmollenhour commented Sep 24, 2018

Uh oh!

davidalger commented Sep 27, 2018 •

edited

Loading

Uh oh!

colinmollenhour commented Sep 27, 2018

Uh oh!

boldhedgehog commented Oct 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added 'retry_reads_on_master' option for when read slaves are used #134

Added 'retry_reads_on_master' option for when read slaves are used #134

Uh oh!

Conversation

davidalger commented Sep 21, 2018

Uh oh!

colinmollenhour commented Sep 21, 2018

Uh oh!

davidalger commented Sep 22, 2018

Uh oh!

colinmollenhour commented Sep 24, 2018

Uh oh!

davidalger commented Sep 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colinmollenhour commented Sep 27, 2018

Uh oh!

boldhedgehog commented Oct 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidalger commented Sep 27, 2018 •

edited

Loading