Skip to content

Document failover support #818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sammefford opened this issue Sep 18, 2017 · 18 comments
Closed

Document failover support #818

sammefford opened this issue Sep 18, 2017 · 18 comments

Comments

@sammefford
Copy link
Contributor

No description provided.

@vivekmuniyandi
Copy link
Contributor

vivekmuniyandi commented Sep 21, 2017

This is documentation on Failover in DMSDK.

Failover Support

 
In order to add failover support for DMSDK, we need two main listeners – the HostAvailabilityListener (retrying requests getting the URIs from the server) and RetryListener (retrying the QueryBatchListener after the batch is got from the server). The HostAvailabilityListener is automatically registered with the QueryBatcher and WriteBatcher when the instances are created. There are a few configurations which can be set with the HostAvailabilityListener such as minHosts (the minimum number of hosts required for the job to run), suspendTimeForHostUnavailable (the time for which the host which is unavailable should be suspended), hostUnavailableExceptions (the list of Throwable types which indicate that the host is unavailable). They come with a default setting

  • minHosts set to 1
  • suspendTimeForHostUnavailable set to 10 minutes
  • hostUnavailableExceptions set to SocketException, SSLException, UnknownHostException

However, these default settings can be customized according to user needs by the static method available within the HostAvailabilityListener. You can customize it by the following code:

HostAvailabilityListener.getInstance(Batcher)
          .withSuspendTimeForHostUnavailable(Duration.ofMinutes(60))
   	  .withMinHosts(2)

 
This would set the minHosts to 2 and the suspendTimeForHostUnavailable to 60 minutes for the HostAvailabilityListener associated with the Batcher.
 
HostAvailabilityListener is registered automatically with both the QueryBatcher and WriteBatcher in order to facilitate retries of batches which have failed with exceptions specified in the list of hostUnavailableExceptions. If a host is detected to be down (identified by failed batches having exceptions mentioned in hostUnavailableExceptions), we check if the number of remaining hosts is greater than the minHosts in order to continue running the job. If it is less than minHosts, we stop the job.
 
If the minHosts criteria is satisfied, the existing ForestConfiguration is modified in order to take into account the host which went down. For the time mentioned in suspendTimeForHostUnavailable, the host remains suspended. We schedule an asynchronous task after the time mentioned in suspendTimeForHostUnavailable to read the ForestConfiguration again from the server. Until then, we use the modified ForestConfiguration without the host that went down.
 
The unavailable host is black listed in the ForestConfiguration and the unavailable host will be replaced with some random available host which is up and running to which we can talk to. Thus, when we get the list of forests from the modified ForestConfiguration, the list of forests wouldn’t have the unavailable host and there would be no more connection failures. This would be the configuration for time specified in suspendTimeForHostUnavailable, after which we read the ForestConfiguration from the server and update our ForestConfiguration accordingly.

Provided Listeners like ApplyTransformListener

These Listeners are registered with QueryBatcher to do some actions like applying some transform or delete documents. HostAvailabilityListener handles retries of batches that were failed and attempts to bring the batch of URIs again from the server. But there might be scenarios where the batch of URIs is got from the server but when attempting to apply the listener (do a transform or delete) the batch had failed. In order to enable retry processing the listener in that scenario, the listener must have initialized a RetryListener (a nested class of HostAvailabilityListener) and added it to its list of failure listeners. The RetryListener takes care of retrying to apply the listener to the batch which failed.
 

Create Custom Listeners that can handle Failover

In order for the listeners to handle failover, we have to have the implement the default method we have in QueryBatchListener – initializeListener(QueryBatcher) which does nothing. The custom listeners should override the default method and have the following implementation of initializeListener method in order to handle failover.

  @Override
  public void initializeListener(QueryBatcher queryBatcher) {
    HostAvailabilityListener hostAvailabilityListener = HostAvailabilityListener.getInstance(queryBatcher);
    if ( hostAvailabilityListener != null ) {
      BatchFailureListener<QueryBatch> retryListener = hostAvailabilityListener.initializeRetryListener(this);
      if( retryListener != null ) onFailure(retryListener);
    }
  }

and this method would be called for all of the QueryBatchListeners registered with the QueryBatcher when the job is started. Since initializeListener(QueryBatcher) is a default method which does nothing, calling on listeners which have the default implementation would do nothing but listeners which have overrided the default implementation to initialize the HostAvailabilityListener’s RetryListener as shown above would be called and the RetryListener would be initialized to handle failovers.

HostAvailabilityListener’s initializeRetryListener(QueryBatchListener) creates a RetryListener, if not already created, for the QueryBatchListener and returns the same. If it is already created, then it returns null. We need to add the RetryListener to the listener's failure listeners and that needs to be done only once. Hence if initializeRetryListener(QueryBatchListener) returns a RetryListener, we add it to our failure listeners list and if it returns null, we just ignore it.

@vivekmuniyandi
Copy link
Contributor

Kim,

Sam had a comment on the priority of a section in the document which I had sent to you over mail. Please take that into account when you document. Many Thanks.

Vivek

@kcoleman-marklogic
Copy link
Contributor

Thank, @vivekmuniyandi , I will. Is Sam's comment reflected in the version of your writeup that is shown here in this issue, or should I look for his feedback elsewhere?

I will take up this topic in earnest as soon as we're done with 9.0-3. Thank you for making the details available so promptly.

@jmakeig
Copy link
Contributor

jmakeig commented Sep 21, 2017

@vivekmuniyandi, this is a great write-up. Really helpful.

Is the failover behavior the default?

HostAvailabilityListener is automatically registered with the QueryBatcher and WriteBatcher when the instances are created.

That seems to imply so. Presumably then this will change the behavior of existing code (albeit generally for the better). Or am I misunderstanding.

@vivekmuniyandi
Copy link
Contributor

@kcoleman-marklogic The following section in the write up is of low priority. His feedback is in the document I sent to you over mail. Sorry, I should have changed it.

There are a few configurations which can be set with the HostAvailabilityListener such as minHosts (the minimum number of hosts required for the job to run), suspendTimeForHostUnavailable (the time for which the host which is unavailable should be suspended), hostUnavailableExceptions (the list of Throwable types which indicate that the host is unavailable). They come with a default setting - minHosts set to 1, suspendTimeForHostUnavailable set to 10 minutes, hostUnavailableExceptions set to SocketException, SSLException, UnknownHostException. However, these default settings can be customized according to user needs by the static method available within the HostAvailabilityListener. You can customize it by the following code:
HostAvailabilityListener.getInstance(Batcher)
.withSuspendTimeForHostUnavailable(Duration.ofMinutes(60))
.withMinHosts(2)
This would set the minHosts to 2 and the suspendTimeForHostUnavailable to 60 minutes for the HostAvailabilityListener associated with the Batcher.

Thanks

@kcoleman-marklogic
Copy link
Contributor

It's fine. I just wanted to be sure I knew where to look. Thanks for the clarification, Vivek.

@vivekmuniyandi
Copy link
Contributor

Is the failover behavior the default?

@jmakeig Yes, the failover behaviour is default.

this will change the behavior of existing code

I don't understand. This was the behavior right from the start I guess and we are just documenting it now. Let me know if I have misunderstood the question. Thanks.

@kcoleman-marklogic
Copy link
Contributor

@vivekmuniyandi I have a question about this section of your most excellent writeup:

Provided Listeners like ApplyTransformListener

These Listeners are registered with QueryBatcher to do some actions like applying some transform or delete documents. ... there might be scenarios where the batch of URIs is got from the server but when attempting to apply the listener (do a transform or delete) the batch had failed. In order to enable retry processing the listener in that scenar, the listener must have initialized a RetryListener ...

Do ApplyTransformListener and DeleteListener have such a retry listener attached by default, or is it something the user must do?

The ApplyTransformListener javadoc talks only about the empty response case, so that didn't clear it up for me, either.

@kcoleman-marklogic
Copy link
Contributor

kcoleman-marklogic commented Oct 5, 2017

Actually @vivekmuniyandi now that I've given it more thought, it would be very helpful thave an example of how to do what you said with a RetryListener. I looked through the tests and could not find any use of this class. I would like to include an example the documentation.

For example, if I had a block of code like the following, how would I change to enable retry of the transform in the event of failover?

    batcher.withConsistentSnapshot()
           .onUrisReady(
               new ApplyTransformListener()
                  .withTransform(txform)) 
           .onQueryFailure( exception -> exception.printStackTrace() );

@kcoleman-marklogic
Copy link
Contributor

@vivekmuniyandi Is it true that custom listeners that do not go through a QueryBatcher do not need to do anything special wrt failover handling? In your writeup, you only mentioned custom listeners invoked via a QueryBatcher.

@vivekmuniyandi
Copy link
Contributor

Do ApplyTransformListener and DeleteListener have such a retry listener attached by default, or is it something the user must do?

With NoResponseListener, we have two types of RetryListeners.

  1. RetryListener associated with HostAvailabilityListener and
  2. RetryListener associated with NoResponseListener.

For DeleteListener, we have initialized both the RetryListeners and hence the customer doesn't need to do anything since deletion is an idempotent operation.

But for ApplyTransformListener, since they cannot be assumed as idempotent operations, we have intialized 1 and not 2. The empty response case is for 2 and we won't be sure if the transform has been applied on the server. Hence it is not ideal to retry but we leave it to the customer to decide how they want to retry for the failed batches.

Please see this comment #813 (comment) for detailed explanation

@vivekmuniyandi
Copy link
Contributor

@vivekmuniyandi Is it true that custom listeners that do not go through a QueryBatcher do not need to do anything special wrt failover handling? In your writeup, you only mentioned custom listeners invoked via a QueryBatcher.

Currently all the custom listeners we have are associated with the QueryBatcher. And retries become complicated with the QueryBatcher only. With WriteBatcher, we keep retrying until all the documents have been successfully written since they are idempotent.

@vivekmuniyandi
Copy link
Contributor

For example, if I had a block of code like the following, how would I change to enable retry of the transform in the event of failover?

For listeners doing idempotent operations like DeleteListener, we have overrided the initializeListener(QueryBatcher) method like this

 @Override
  public void initializeListener(QueryBatcher queryBatcher) {
    HostAvailabilityListener hostAvailabilityListener = HostAvailabilityListener.getInstance(queryBatcher);
    if ( hostAvailabilityListener != null ) {
      BatchFailureListener<QueryBatch> retryListener = hostAvailabilityListener.initializeRetryListener(this);
      if ( retryListener != null )  onFailure(retryListener);
    }
    NoResponseListener noResponseListener = NoResponseListener.getInstance(queryBatcher);
    if ( noResponseListener != null ) {
      BatchFailureListener<QueryBatch> noResponseRetryListener = noResponseListener.initializeRetryListener(this);
      if ( noResponseRetryListener != null )  onFailure(noResponseRetryListener);
    }
  }

This would take care of all the retries and the customer need not do anything. If they are implementing custom listeners which does idempotent operation, they have to override the above method with the implementation shown above and this would take care of failover.

But in case of ApplyTransform, if it is idempotent, we have to have code something like this

ApplyTransformListener listener = new ApplyTransformListener().withTransform(transform)
				.withApplyResult(ApplyResult.REPLACE).onSuccess(batch -> {
					success.addAndGet(batch.getItems().length);
				}).onSkipped(batch -> {
					skipped.addAndGet(batch.getItems().length);
				});
QueryBatcher batcher = dmManager.newQueryBatcher(new StructuredQueryBuilder().collection("XmlTransform")).onUrisReady(listener).withBatchSize(10).withThreadCount(5);
NoResponseListener noResponseListener = NoResponseListener.getInstance(batcher);
if (noResponseListener != null) {
	BatchFailureListener<QueryBatch> retryListener = noResponseListener.initializeRetryListener(listener);
	if (retryListener != null) {
		listener.onFailure(retryListener);
	}
}

For non idempotent transforms, we shouldn't initialize the NoResponseListener's RetryListener as shown above but the customer should write their own BatchFailureListener which would check if the batch has been transformed or not and retry only if the batch has not been transformed. Hope that helps.

@kcoleman-marklogic
Copy link
Contributor

OK, let me say this back in words to see if I understand what you said in wrt to custom listeners.

Case 1: You want to retry the custom operation if there is a failover

Create a class that implements QueryBatchListener and override the initializeListener method to do the following:

  • Obtain the HostAvailabilityListener from the batcher
  • Call HostAvailabityListener.initializeRetryListener to obtain a RetryListener
  • Register the RetryListener as an onFailure listener for your custom listener
  • Obtain a NoResponseListener from the batcher
  • Call NoResponseListener.initializeRetryListener to obtain a RetryListener
  • Register the RetryListener as an onFailure listener for the NoResponseListener

Case 2: You do not want to retry the custom operaton if there is a failover

  • Obtain a NoResponseListener from the batcher
  • Call NoResponseListener.initializeRetryListener to obtain a RetryListener
  • Register the RetryListener as an onFailure listener for your listener

Case 3: You want to conditionally retry

  • Implement your own BatchFailureListener capable of determining whether
    or not to retry a batch
  • Attach an instance of your BatchFailureListner as an onFailure listener
    of your custom listener

To retry a batch, call QueryBatcher.retry from your BatchFailureListener

@vivekmuniyandi
Copy link
Contributor

vivekmuniyandi commented Oct 6, 2017

Case 1 and 3 looks good to me. If we don't want to retry then we don't need to do anything - the steps you mentioned in Case 2 also handles failover scenarios. These are scenarios where we get no response from the server during failover. NoResponseListener is used to handle that. So you don't need to do anything if you don't want to retry at all.

@kcoleman-marklogic
Copy link
Contributor

Perfect. Thank you, @vivekmuniyandi .

@sammefford
Copy link
Contributor Author

@kcoleman-marklogic can we ship this issue?

@kcoleman-marklogic
Copy link
Contributor

Yes, I suppose. I am not convinced what we have makes sufficient sense, but I guess we'll just have to see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants