Skip to content

Routing not properly working in DC/OS #233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
neurofoo opened this issue Apr 20, 2017 · 13 comments
Closed

Routing not properly working in DC/OS #233

neurofoo opened this issue Apr 20, 2017 · 13 comments

Comments

@neurofoo
Copy link

Hello all. I was requested to raise this as an issue from the neo4j-dcos slack channel.

tl;dr routing doesn't appear to work using the dcos neo4j ee universe package.

Below is the longer writeup.

We have an issue that we hope that you might be able to help us solve. The main issue is that the javascript driver's bolt routing doesn't appear to work. It is our understanding that by specifying the scheme "bolt+routing" in the javascript driver we can connect to any node and the driver will take care of discovering the network and selecting which nodes to use for different operations.

We have neo4j-ee running in a Mesosphere DC/OS cluster in Google Cloud.

We used the neo4j Mesosphere Universe package to launch a basic three node cluster (just core nodes). The neo4j cluster is using the standard private dcos network (9.0.0.0/8).

To confirm that we are running the EE version, we can check the version:

http://localhost:7474/db/manage/server/version
{
  "edition" : "enterprise",
  "version" : "3.1.2"
}

The cluster also looks like it was installed and booted correctly. The output logs show that each node found all the other nodes. E.g.,

<snip>
Discovering cluster with initial members: [9.0.3.130:5000, 9.0.5.130:5000, 9.0.4.130:5000]
<snip>

We have a docker node container in which we are running a few tests. In the container we load the neo4j js drivers as:

$ npm install neo4j-driver@next

(We have also tried the neo4j-driver and neo4j-driver@latest)

We have a simple node app for checking connectivity (write/read):

var neo4j = require('neo4j-driver').v1;
const neo4j_user = "a_user"
const neo4j_pass = "a_pass"
const neo4j_ip = "9.0.3.130"
var driver = neo4j.driver(`bolt+routing://${neo4j_ip}`, neo4j.auth.basic(neo4j_user, neo4j_pass))
var session = driver.session()

session
  .run( "CREATE (a:Person {name:'Arthur', title:'King'})" )
  .then( function()
  {
    return session.run( "MATCH (a:Person) WHERE a.name = 'Arthur' RETURN a.name AS name, a.title AS title" )
  })
  .then( function( result ) {
    console.log( result.records[0].get("title") + " " + result.records[0].get("name") );
    session.close();
    driver.close();
  }).catch(function(err){
    session.close();
    driver.close();
    console.log(err);
  });

This works so long as the ip specified is that for the leader. If we give it an ip for one of the followers, we get the following error message:

{ Error: Could not perform discovery. No routing servers available.
    at new Neo4jError (/home/node/node_modules/neo4j-driver/lib/v1/error.js:67:132)
    at newError (/home/node/node_modules/neo4j-driver/lib/v1/error.js:57:10)
    at /home/node/node_modules/neo4j-driver/lib/v1/internal/connection-providers.js:222:35
    at process._tickDomainCallback (internal/process/next_tick.js:135:7) code: 'ServiceUnavailable' }

If we use just the 'bolt' scheme instead of 'bolt+routing', we can't write to non-leader nodes and receive a 'not a leader' error message.

As a sanity check, we checked to make sure that all the nodes have route roles. They do:

CALL dbms.cluster.routing.getServers()
[
addresses	[9.0.3.130:7687]
role	WRITE
,
addresses	[9.0.4.130:7687, 9.0.5.130:7687]
role	READ
,
addresses	[9.0.4.130:7687, 9.0.3.130:7687, 9.0.5.130:7687]
role	ROUTE
]

We also checked to make sure that the general cluster routes were correct (from inside the neo4j running containers):

$ dig core-neo4j.marathon.containerip.dcos.thisdcos.directory
<snip>
;; ANSWER SECTION:
core-neo4j.marathon.containerip.dcos.thisdcos.directory. 5 IN A	9.0.5.130
core-neo4j.marathon.containerip.dcos.thisdcos.directory. 5 IN A	9.0.4.130
core-neo4j.marathon.containerip.dcos.thisdcos.directory. 5 IN A	9.0.3.130
<snip>

We also ran the above node app during an interactive session to look at the routing tables of the driver/session.

When we use the leader ip, we get:

> session._writeConnectionHolder._connectionProvider._routingTable.routers
RoundRobinArray { _items: [ '9.0.3.130' ], _offset: 0 }

When we use either of the follower ips, we get:

> session._writeConnectionHolder._connectionProvider._routingTable.routers
RoundRobinArray { _items: [ '9.0.4.130' ], _offset: 0 }

and the ip just changes based upon the one that we used. It appears that the routing table hasn't been properly loaded.

So, at this point we are a little stuck and not sure what the issue is. Everything appears to be configured correctly, but no dice on connecting to a non-leader node and using the routing, which is a feature we would love to use.

@lutovich
Copy link
Contributor

Hello @neurofoo, thanks for reporting this problem.

Attached code works fine for me with a local cluster. I'm able to execute both queries and routing table looks properly updated.

You are right about bolt+routing scheme. It should be possible to specify any core member URI there and driver should figure things out. Error ServiceUnavailable that you observe most likely means that driver was not able to connect to the specified seed router URI. Observed routing table is thus correct - it contains a single seed URI as a single known router.

Unfortunately we do not have debug logging in JS driver right now. Do you have a possibility to modify driver's source code in node_modules? Could you please add couple printouts:

I hope with this logging we can get more insight into what is wrong.

@neurofoo
Copy link
Author

Hello @lutovich,

Here's the console output:

{ Error: Connection was closed by server
    at new Neo4jError (/home/node/node_modules/neo4j-driver/lib/v1/error.js:67:132)
    at newError (/home/node/node_modules/neo4j-driver/lib/v1/error.js:57:10)
    at NodeChannel._handleConnectionTerminated (/home/node/node_modules/neo4j-driver/lib/v1/internal/ch-node.js:330:41)
    at emitNone (events.js:91:20)
    at TLSSocket.emit (events.js:188:7)
    at endReadableNT (_stream_readable.js:975:12)
    at _combinedTickCallback (internal/process/next_tick.js:80:11)
    at process._tickCallback (internal/process/next_tick.js:104:9) code: 'SessionExpired' }
[ '9.0.4.130' ]
{ Error: Could not perform discovery. No routing servers available.
    at new Neo4jError (/home/node/node_modules/neo4j-driver/lib/v1/error.js:67:132)
    at newError (/home/node/node_modules/neo4j-driver/lib/v1/error.js:57:10)
    at /home/node/node_modules/neo4j-driver/lib/v1/internal/connection-providers.js:221:35
    at process._tickCallback (internal/process/next_tick.js:109:7) code: 'ServiceUnavailable' }

@neurofoo
Copy link
Author

@lutovich I don't think I specified above, the DC/OS cluster is 1.9 EE. Is your local cluster 1.8 or 1.9?

@lutovich
Copy link
Contributor

lutovich commented Apr 21, 2017

@neurofoo so it looks like DNS resolution did not do anything wrong with the provided IP address and connection was just closed by the server. Could you please attach neo4j.log and debug.log of the Neo4j database you are trying to connect to? I'm interested in logs for the period when such errors happen, hopefully database wrote the reason for closing the connection.

Sorry I do not get your question about versions. My local cluster is 3.1.2 EE, JS driver 1.2.0, node v6.7.0.

Update: now I understand what you mean by 1.8 and 1.9. I did not use DC/OS locally, just started 3 separate processes. Never used DC/OS before actually. Do you have a script or tool to setup such DC/OS & Neo4j cluster locally?

One more thing you could try is to turn off encryption. This can be done like:

neo4j.driver('...', neo4j.auth.basic('...', '...'),  {encrypted: 'ENCRYPTION_OFF'})

@neurofoo
Copy link
Author

@lutovich turning encryption off didn't work

I haven't install dcos locally in quite a while, but this guide should work: https://dcos.io/docs/1.9/installing/local/

I'm working with unterstein on the slack #neo4j-dcos channel right now.

Looks like it might be related to having upgraded from dcos 1.8 to 1.9.

Working through a few issues. Will post back here with updates.

@neurofoo
Copy link
Author

@lutovich I can confirm that on a new DC/OS cluster with a clean installation of neo4j from the Mesosphere universe, the driver appears to work as expected. But, I haven't finished all tests.

I'm investigating what differences there are between the new cluster (1.9; no upgrades) and the others that were upgraded from 1.8 to 1.9 that had the driver issues.

@lutovich
Copy link
Contributor

@neurofoo thanks for the update!

@neurofoo
Copy link
Author

neurofoo commented Apr 21, 2017

@lutovich welcome.

it looks like the problem was due to user roles.

In the examples that produced the errors above, I was using a user that had a publisher role.

If I switch to an admin user, then I don't have the issues.

I would expected based upon this: https://neo4j.com/docs/operations-manual/current/security/authentication-authorization/native-user-role-management/native-roles/

that I ought to be able to use a user with publisher role and that this seems to be best practice. I don't want my webapp's running around with admin privileges.

Can you confirm that if you use a user with just a publisher role that you get the errors that I describe?

@neurofoo
Copy link
Author

@lutovich okay. I think I have the source of the error and this issue can be closed.

The original user I was using had role publisher, but that user was not properly propagated through the cluster (only showed up in the leader). So, when I tried connecting to a non-leader, I got the above described errors.

However, those errors are pretty cryptic because there wasn't an indication that the user wasn't on the node.

Perhaps adding something to error message that the user wasn't found?

@lutovich
Copy link
Contributor

@neurofoo yes, I can reproduce the same issue when using publisher role. Unfortunately users are currently not replicated in a causal cluster. This is definitely in out list of priorities.

This is the error I see in security.log on followers:

2017-04-21 17:40:39.071+0000 ERROR [AsyncLog @ 2017-04-21 17:40:39.071+0000]  [user-publisher]: failed to log in: invalid principal or credentials

You are right, error messages should be much better. I'll keep this issue open until we decide how to improve error messages. Thanks a lot for tracking this issue this far!

@neurofoo
Copy link
Author

@lutovich fantastic! many thanks for the confirmation!

Might I also suggest noting this in the docs (http://neo4j.com/docs/operations-manual/current/clustering/causal-clustering/) as a giant !!NB!! for users.

Again, thanks!

@lutovich
Copy link
Contributor

@neurofoo it is mentioned in the manual here but it is not really giant :)

Anyways, good exception message would be really helpful here. Thank you!

@lutovich
Copy link
Contributor

Hi @neurofoo,

Authentication error propagation was released with 1.3.0 driver.
I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants