-
Notifications
You must be signed in to change notification settings - Fork 150
All connections from pool are suddenly closed or never released #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update in this issue: I was checking the rest of the logs, and I saw several errors like this some hours before all the connections were suddenly closed.
Does this could be the cause that all the connections from the driver are suddenly closed? |
Update in this issue: I changed the configuration of the Neo4j Database, setting |
Update in this issue: Even with the new configuration of
Changing the configuration only seems to delay this error regarding connection acquisition. Please help, I don't know if I am not using the sessions correctly. This is an example of how I am using them in every call: isPortConnected: portId => {
return new Promise((resolve, reject) => {
var neo4jsession = driver.session();
var connectedPortCountPromise = neo4jsession.writeTransaction(tx =>
tx.run(`
MATCH (p:Port)-[:DISTANCE]-()
WHERE p.port_id = ${portId}
RETURN count(p)`
)
);
connectedPortCountPromise.then(connectedPortCount => {
resolve(!(
connectedPortCount &&
Array.isArray(connectedPortCount.records) &&
connectedPortCount.records.length > 0 &&
connectedPortCount.records[0].get("count(p)").isZero()
));
}).catch(e => {
console.error(e);
reject(e);
}).then(() => {
neo4jsession.close();
});
});
}, I am following the docs on how to use the sessions, so I am expecting that the connections are released as soon as I do a |
Hi @elielr01 Your code of using the driver looks correct. So there might be a bug in the driver that does not release the connection properly. It might relate to the We will keep you updated. |
Hi @elielr01, We identified one js bug that could cause your problem. We will do a patch release soon. It would be great if you could verify if the fix addresses your issue in the coming release. Cheers, |
We have encountered exactly the same problem in our deployment. I have tried deploying 1.7.4 to our containers, but within two hours the same issue came back. |
Hi @GlacianNex & @elielr01, Have you tried the latest driver release 1.7.5? Do you still experience the same problem? Thanks. |
Hi @ali-ince Unfortunately, we changed our service and stop using neo4j some months ago, so I'm not sure if I can replicate my issue. Although I did check the patch in the PR and it looked as that did solve the problem. |
We're doing something similar, getting basically verbatim results to @elielr01, while using 1.7.5. If we exceed our thread pool limits, we get errors - and understandably so. However, once this event occurs, a chunk of our driver's connection pool is never released. I was digging into the internals to try and understand how a session's lifecycle worked, in case we might be missing something. Under connectionHolders - it seems that if there's an error in releasing the connection, the promise is caught, and the error is suppressed? Should any errors that occur here get logged, rather than ignored? https://github.com/neo4j/neo4j-javascript-driver/blob/4.0/src/internal/connection-holder.js#L129 (edit, I was on the 1.7 branch - the error handler is still there though) |
As another data point we just hit this today also using 1.7.5. As far as we can see it never recovers. |
Yeah likewise, we ended up coding around it (resetting the connection where
it enters that state).
…On Fri, Sep 27, 2019 at 4:10 AM Ben Griffiths ***@***.***> wrote:
As another data point we just hit this today also using 1.7.5. As far as
we can see it never recovers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#446?email_source=notifications&email_token=AA2RFDILDPUOMMYRPJK7KQDQLW5X5A5CNFSM4HBV44H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7YEGLQ#issuecomment-535839534>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2RFDOHCUOBTQ5IQ4QL7ADQLW5X5ANCNFSM4HBV44HQ>
.
|
@GlacianNex - any chance you can share how you wrote around it? |
In our case the application is stateless and is running a docker container. Once we start experiencing the connection pool issues we just stop the entire application and restart the container. When the container comes back the problem seems to be resolved. I am REALLY not happy with the solution but we have spent too much time on this problem internally and this allowed us to move forward. I hope this helps. |
@GlacianNex Thanks! That's basically what we did. We recorded the driver's internals to get an idea of when/how often/why we got into this state, and then scheduled rolling deployments to reset the driver state at intervals. Hopefully @ali-ince @zhenlineo and the rest of the team can figure this one out. |
Running into this issue as well under heavy load Driver: 1.7.6
|
Is there a version of the driver that doesn't have this issue? |
This issue is severe enough for our team that we are not going to develop our application based on this driver, rather we are going to use: :( |
@fedevela I'd be wary of that, it's:
I get your frustrations. Our team is struggling to handle recoveries with this driver too. And there doesn't seem to be any updates from the neo team around this either. We've got a new working solution: Our team moved from scheduled redeploys, to rotating out the active driver with a new one.
It's a complicated workaround solution, for sure. And it won't prevent threadpool problems. But it does mean we can recover now without a redeploy, and stay in a stable state. I'd share the exact code, but it's locked down as proprietary. |
@matthewoden thank you for the insight, you have shed a lot of light on the subject, thank you very much it is frustrating for sure; the old driver has been working for us without problems, i had been considering changing to this, the official driver :/ but will not do so for the near future We have not yet experienced high loads, nor tested them but the systems have been quite stable for now thank you very much again! |
Hi guys, Sorry to not back on this problem recently. I will have a closer look at this issue tomorrow. Will keep you guys updated. Cheers, |
Hi, We've been looking into this issue. Most of our investigation is based on @elielr01 's logs and problem description. For his original issue, we suspect that his log is missing a Here might be what happened in his case (Single community server + bolt 1.7.5 driver):
However as so many others are also have the same problem thus we would like to ask for more information from everyone who has this issue:
Thanks in advance. |
Hi @matthewoden, May I ask if you could help us with this issue by giving us more logs.
When doing driver logs, you can also set up a periodic logger where you can print driver connection pool using the following code to help us better know if there are any connection leaks:
We've been looking at @elielr01 logs and code, but we unfortunately cannot find a good reason to fully explain the cause of the problem. So we would like to ask everyone here to give us more informations about their running env and logs. Thanks in advance. |
@zhenlineo Fantastic - I'll see what I can do. We currently have a solution that keeps things stable for our users, but I think I understand how we get into this state enough to recreate it on an isolated instance. Edit: I can say that prior to my driver-rotation solution, we never closed the driver. So unless the driver is closing itself... |
Ok, so I can recreate the state.
neo4j.conf required to recreate the issue locally:
(Yes, I know this is impossibly small. But the issue starts for us when the threadpool is exceeded, and this makes that happen pretty much immediately.) Server logs from running neo in console mode, up to the point of the problem. Small redactions for proprietary nonsense.
Snippet of the driver logs on debug right before the error:
At this point, the logs go quiet. Nothing happens. The driver hangs indefinitely. No queries proceed. Now, I can't log verbatim what you posted there. We've got circular references that need cleaning up, and some pretty deep objects. For other folks running these logs, I used the following to do that: const clean = (obj) => {
let i = 0
return (key, value) => {
if (i !== 0 && typeof obj === 'object' && typeof value === 'object' && obj == value)
return '[Circular]'
// you can't go deeper than 30 nested layers, either.
if (i >= 29) return '[Unknown]'
++i
return value
}
}
const stringify = (obj) => JSON.stringify(obj, clean(obj)) If you need more specific information, let me know what fields. Anyway, now that I can stringify, it produced the following:
That last state is when the driver hangs up - it never changes. I dropped the maxConnectionLifetime to 30 seconds, then let it run for a minute. Here's that event in isolation, for easier consumption.
Digging deeper into that connection on a second pass, without using stringify Connection {
id: 2,
address: ServerAddress {
_host: 'localhost',
_resolved: null,
_port: 7687,
_hostPort: 'localhost:7687',
_stringValue: 'localhost:7687'
},
server: { address: 'localhost:7687', version: 'Neo4j/3.4.7' },
creationTimestamp: 1574357130760,
_errorHandler: ConnectionErrorHandler {
_errorCode: 'ServiceUnavailable',
_handleUnavailability: [Function: noOpHandler],
_handleWriteFailure: [Function: noOpHandler]
},
_disableLosslessIntegers: false,
_pendingObservers: [],
_currentObserver: {
onError: [Function: onError],
onCompleted: [Function: NO_OP],
onNext: [Function: NO_OP]
},
_ch: NodeChannel {
id: 2,
_pending: null,
_open: true,
_error: null,
_handleConnectionError: [Function: bound _handleConnectionError],
_handleConnectionTerminated: [Function: bound _handleConnectionTerminated],
_connectionErrorCode: 'ServiceUnavailable',
_conn: [TLSSocket],
onerror: [Function: bound _handleFatalError],
onmessage: [Function]
},
_dechunker: Dechunker {
_currentMessage: [],
_partialChunkHeader: 0,
_state: [Function: AWAITING_CHUNK],
onmessage: [Function],
_chunkSize: 3
},
_chunker: Chunker {
position: 7492,
length: 0,
_bufferSize: 1400,
_ch: [NodeChannel],
_buffer: [NodeBuffer],
_currentChunkStart: 0,
_chunkOpen: false
},
_log: Logger { _level: 'error', _loggerFunction: [Function: logger] },
_dbConnectionId: undefined,
_protocol: BoltProtocol {
_connection: [Circular],
_packer: [Packer],
_unpacker: [Unpacker]
},
_currentFailure: null,
_isBroken: false,
_release: [Function]
} Seems when the threadpool is exceeded, it's a service availability error? The connection is left open, and not marked as an error. At this point, I dug into the driver to add some extra logging. In there, I found a noOp handler that just returns the error without running any checks. That doesn't seem great. The connection-holder also has an ... so yeah, I put some logging in the
Well, hot damn. The threadpool error! Now, I had been running on a hot instance. So I shut everything down, and tried to recreate from scratch. Turns out, this error doesn't get returned to the client until the neo server exceeds the threadpool limit 10 times. Edit: Clarity |
Hi @elielr01 ,
The bug was found in 3.5 and then fixed in 3.5.6. From your logs, it looks like I just verified 3.4 servers all suffers from this bug. I would suggest you to upgrade to latest 3.5 servers. If you cannot upgrade to latest 3.5, and you have a support contract with Neo4j, you shall consider raising a support case there. Thanks again for the detailed logging to help us understand this issue! |
@zhenlineo Fantastic! I just ran a test upgrade to 3.5.12, and blew straight past the hangup issue and immediately returned an error. (Obviously avoiding the threadpool issues is on us) |
@matthewoden We are having neo4j return custom json so we are not running into any issues at all with Neo4J 3.5.12
|
@zhenlineo - looks like this can be closed now, can you confirm? |
mark |
I have an Express server which do some queries to our Neo4j database. At the very beginning of the server lifetime, I create the neo4j driver singleton as recommended in the docs. After some days of normal use, the driver suddenly closes all the connections of the pool (there is no
driver.close()
, no reassignation to the driver, nor nothing.. only getting and closing sessions and writeTransactions in the code).Neo4j Version: 3.4.12 Community
Neo4j Mode: Single instance
Driver version: JS driver 1.7.1
Operating System of DB: CentOS 7 on GCP
Operating System of Server using neo4j-driver: Debian GNU/Linux 9 (stretch) on GCP
Pre-requisites
Having a Neo4j DB instance running and the server correctly configured(which I might be doing wrong maybe?) to communicate to the neo4j instance.
Steps to reproduce
Neo4jError: Connection acquisition timed out in 60000 ms.
Expected behavior
All the connections are open during the whole driver's lifecycle (which should be the whole application lifetime according to the docs)
Actual behavior
The connections are all closed and I cannot get more connections from the driver.
This is the driver configuration:
I also implemented the logger as one of the suggestions in the answers you guys gave in one issue. Since it was running for a while, it generated a log of about 210 MB. I split it in 25 log files, the first 24 useless and the last one is where the error can be seen. I'm just quoting the important parts of the log, but I'll also attach the last log file if that helps. The connections closing starts at line 2,700 of the file.
Lastly, I just wanted to say thanks guys for the nice efforts done in this module! I really appreciate the work done here. I just want some help with my issue.
xay.txt
The text was updated successfully, but these errors were encountered: