-
Notifications
You must be signed in to change notification settings - Fork 45
Insert samples using the native client #7013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9acd3a6
to
0e03d95
Compare
- Remove unneeded and error-prone `n_columns` and `n_rows` fields from the `Block` type. These are now methods that delegate to the actual column arrays, which are the source of truth anyway. - Add methods for extracting fields from a sample as a set of data blocks, one per field table. - Add methods to extract measurements from a sample as a block destined for one table. Take care to correctly construct missing samples, especially for histograms. Also implement the `FromBlock` trait for samples, to ensure we can extract the raw data as well. This is only used in tests, in this commit. - Insert fields and measurements from a sample as a data block, using the native interface. - Fix serde of UUIDs in native format, which doesn't match the documentation at least for our current version of ClickHouse. - Remove code serializing fields and measurements into JSON. - Closes #6884
0e03d95
to
14fe026
Compare
Interesting. The test failure here is the integration test that tries to insert / select various data on a replicated cluster, while killing and restarting both data replicas and keeper nodes. I've run this test locally a few times, and it seems to work sometimes, but more often than not hangs here:
That shows we've killed the first keeper successfully, and also killed the second keeper. I can confirm that at least one of PIDs is indeed gone, but
That process doesn't seem to be doing anything though. The log file has messages from a few minutes ago:
What does appear to be moving is the other keeper. I see a spew of logs like this:
It looks like keeper-3 is still trying to send messages to one of the other keepers, which I think is now dead. In the CI run that failed, these errors start appearing around here. My local run that's stalled seems to be trying to connect to
I'm not sure if it's relevant, but the last thing we're attempting to do in the test itself is insert some samples. I don't know why this would have changed when we start inserting data with the native client. |
Also potentially interesting is that this wayward keeper doesn't seem to shutdown when I interrupt the stalled test locally. It's still trying its best to make a connection to |
I know very little about the Raft protocol, but I do see this error in the keeper that refuses to die:
It seems like we're trying to stop the cluster leader, which might be a problem. I find it unlikely that we've never tried to stop the leader, given there are only 3 nodes, but maybe something about the native connections we're using make this more obvious or something. The second thing I see that's interesting is this doc page from ClickHouse. That describes manually recovering a Raft quorum, which might be needed in some cases. In particular:
This seems to be exactly what we're doing (intentionally) -- killing 2 / 3 of the nodes to destroy the quorum. I imagine the "without any chance of starting them again" bit of that doc is important, but it seems possible we're somehow in this dead state. |
Ok, well this part probably explains why the failure occurred after changing to the native client for inserting data: omicron/oximeter/db/tests/integration_test.rs Lines 325 to 345 in 5660966
We're setting a shorter timeout on the insertion request, but that only applies to the HTTP client itself. The TCP client does not have any explicit timeouts set at all, so the insertion request should just hang forever since both sides are alive. I'll see what happens if I set a shorter timeout on the use of the native connection. |
Using the shorter request timeout of 2s does indeed cause the tests to pass pretty reliably. I need to think a bit more about whether that's really what we want to do, but some kind of timeout on the connections seem useful. It's pretty heavy-handed to set a |
Stopping the leader is totally fine. In this case though, it looks like the leader is the one that wouldn't stop and a protocol optimization has determined that the leader doesn't yield in 2 node clusters.
We definitely do not need to do this. This is for the case when quorum is permanently gone. Think 2 out of 3 disks catch fire. In that case we essentially promote the remaining node into a single node cluster. This virtually guarantees data loss, but is the only thing you can do in such a case.
We are intentionally losing keeper quorum temporarily to test that we can still perform queries from clickhouse as queries don't require keeper operation. We are also checking that we cannot insert when we don't have quorum as the cluster becomes readonly in this case. Later in the test we bring the keeper in the 2 node cluster back up to ensure we can write again. |
Thanks @andrewjstone that's very helpful. That makes sense that the manual process is for a cluster that cannot ever recover on its own, though that wasn't really clear from the docs themselves.
Ok, so this test checks that, with no quorum, we can still read data, but cannot write it? I'm a little surprised the replica doesn't time out or fail the insertion itself, and that we need to rely on a client-side timeout. I'm a bit loathe to do that, since the only way I can see is the blunt instrument that is |
This is just a short timeout for the test to make it go faster. We are checking to see if we fail to insert when we don't have quorum. It's a negative test, in which I was looking for an error. The only way to get an error without quorum is to have the client timeout. If we were directly making a request to the keeper cluster we'd fail with a timeout or immediately (depending upon implementation) when there is no leader. But we aren't doing that. We are talking to the clickhouse servers which are still up and talk to the keepers in the background. My guess is that presume errors like this to be transient (due to leader rotation - yield) or leader failure and don't want to immediately fail client requests because of that by default. There may be a setting we can add to make the request fail immediately in this case. That would be my preference as well frankly. |
I'm looking back at the settings now, to see if there's one we can apply to inserts that occur without a keeper quorum. @andrewjstone set me straight in chat, and I wanted to record my understanding here. I was under the impression that the keeper was somehow acting differently in the original PR commit (which did not have a query timeout), than it does on That's exactly the same thing that the second commit in this PR does -- apply the same timeout to the interactions via the native client too. When that insertion times out, the test code moves on to the next step, which is bringing the quorum back. The keepers are thus happier, and things continue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
n_columns
andn_rows
fields from theBlock
type. These are now methods that delegate to the actual column arrays, which are the source of truth anyway.FromBlock
trait for samples, to ensure we can extract the raw data as well. This is only used in tests, in this commit.