Skip to content

Durable backend for Distributed Data collections #2490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 7, 2017

Conversation

Horusiath
Copy link
Contributor

@Horusiath Horusiath commented Jan 30, 2017

This PR introduces a durable, persistent backend for ddata. It allows users to specify a list of keys, for which CRDTs should not only be gossiped among cluster nodes but also persisted using durable store. Just like in JVM case, default implementation here uses LMDB (through LigthingDB driver on .NET).

TODO list - my goal here is to make all multinode tests for ddata running before this PR gets merged:

  • ReplicatorSpec
  • ReplicatorPrunningSpec
  • ReplicatorChaosSpec
  • JepsenInspiredInsertSpec
  • DurablePrunningSpec
  • DurableDataSpec

@Horusiath Horusiath added the WIP label Jan 30, 2017
@Horusiath Horusiath force-pushed the ddata-durable branch 2 times, most recently from 1735539 to b7689fe Compare March 3, 2017 22:17
using Akka.Serialization;
using LightningDB;

namespace Akka.DistributedData.LightningDB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JVM version by default uses LMDB backend for storage. I think, it's better to move that dependency to a separate package however.

@Horusiath
Copy link
Contributor Author

Horusiath commented Mar 6, 2017

At this point I've already fixed over 5 different bugs while trying to make ReplicatorSpec passing. Some of them are critical. Right now I'm struggling to find a next one: for some reason it looks like not all keys gets replicated back to the reconnecting node (scenario: disconnect→update while disconnected→reconnect again and wait for replicas to converge). Good thing is that always the same key seems to be missing while replicating (12 of 30).

UPDATE: it turned out that bug lies in original JVM implementation. Already issued it on their tracker.

var n = i;
var keydn = new GCounterKey("D" + n);
_replicator.Tell(Dsl.Update(keydn, GCounter.Empty, WriteLocal.Instance, x => x.Increment(_cluster, n)));
ExpectMsg(new UpdateSuccess(keydn, null));
}
}, _config.First);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is 100% reproducible in each case but it looks like this pattern is a bug in MNTK:

EnterBarrier("after-1");

RunOn(() => {
    // this one gets called
}, firstRole, secondRole);

RunOn(() => {
    // this one never gets called
}, firstRole);

EnterBarrier("after-2");

@Horusiath Horusiath force-pushed the ddata-durable branch 2 times, most recently from a64ca44 to 93663cb Compare March 9, 2017 06:12
@Aaronontheweb
Copy link
Member

@Horusiath what will it take to get this PR into a state so we can include the bug fixes for 1.2?

@Horusiath Horusiath changed the title [WIP] Durable backend for Distributed Data collections Durable backend for Distributed Data collections Apr 6, 2017

EnterBarrier("passThrough-third");

RunOn(() =>
{
_replicator.Tell(Dsl.Get(KeyE, _readMajority));
var c155 = ExpectMsg<Replicator.GetSuccess>(g => Equals(g.Key, KeyE)).Get(KeyE);

var c155 = ExpectMsg<GetSuccess>(g => Equals(g.Key, KeyE)).Get(KeyE);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this place spec fails. Basically what we've done up to this point was:

  1. Establish 3-node cluster with replicator instance on each node.
  2. Perform some updates on each node.
  3. Blackhole 3rd node from the rest of the cluster, making it unreachable.
  4. Perform more CRDT updates on each node.
  5. Pass through unreachable 3rd node again to the rest of the cluster.
  6. While 3rd node is up, replicators should send updates as part of Get request with read majority, and finally converge, but the exception occurs instead, causing node disassociation.

What I've managed to find is that upon marking 3rd node as reachable again - when all nodes have Up state - replicators try to send messages using Context.ActorSelection(Context.Parent.Path.ToStringWithAddress(address)) (Context.Parent is a replicator instance here). But even when I've confirmed that node under address is acknowledged as up, the message never reaches the target. ResolveOne will also cause an exception in this case. I believe that this may be problem with remoting/cluster layer, after unreachable node becomes reachable again.

/cc @Aaronontheweb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same problem in SurviveNetworkInstabilitySpec

@Aaronontheweb Aaronontheweb merged commit 43f2a6f into akkadotnet:dev Apr 7, 2017
@Aaronontheweb Aaronontheweb added this to the 1.2.0 milestone Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants