Skip to content

Insert samples using the native client #7013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 10, 2024

Conversation

bnaecker
Copy link
Collaborator

@bnaecker bnaecker commented Nov 7, 2024

  • Remove unneeded and error-prone n_columns and n_rows fields from the Block type. These are now methods that delegate to the actual column arrays, which are the source of truth anyway.
  • Add methods for extracting fields from a sample as a set of data blocks, one per field table.
  • Add methods to extract measurements from a sample as a block destined for one table. Take care to correctly construct missing samples, especially for histograms. Also implement the FromBlock trait for samples, to ensure we can extract the raw data as well. This is only used in tests, in this commit.
  • Insert fields and measurements from a sample as a data block, using the native interface.
  • Fix serde of UUIDs in native format, which doesn't match the documentation at least for our current version of ClickHouse.
  • Remove code serializing fields and measurements into JSON.
  • Closes Use native ClickHouse protocol to insert samples #6884

@bnaecker bnaecker force-pushed the insert-samples-with-native-client branch 2 times, most recently from 9acd3a6 to 0e03d95 Compare November 8, 2024 00:34
- Remove unneeded and error-prone `n_columns` and `n_rows` fields from
  the `Block` type. These are now methods that delegate to the actual
  column arrays, which are the source of truth anyway.
- Add methods for extracting fields from a sample as a set of data
  blocks, one per field table.
- Add methods to extract measurements from a sample as a block destined
  for one table. Take care to correctly construct missing samples,
  especially for histograms. Also implement the `FromBlock` trait for
  samples, to ensure we can extract the raw data as well. This is only
  used in tests, in this commit.
- Insert fields and measurements from a sample as a data block, using
  the native interface.
- Fix serde of UUIDs in native format, which doesn't match the
  documentation at least for our current version of ClickHouse.
- Remove code serializing fields and measurements into JSON.
- Closes #6884
@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

Interesting. The test failure here is the integration test that tries to insert / select various data on a replicated cluster, while killing and restarting both data replicas and keeper nodes. I've run this test locally a few times, and it seems to work sometimes, but more often than not hangs here:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ cargo nextest run test_cluster --nocapture
info: experimental features enabled: setup-scripts
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.30s
------------
 Nextest run ID 00ce8fb1-176a-4d65-8d5f-5b5b9e4b0119 with nextest profile: default
    Starting 1 test across 3 binaries (229 tests skipped)
       START             oximeter-db::integration_test test_cluster

running 1 test
log file: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.0.log
note: configured to log to "/var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.0.log"
Deploying keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-1
Deploying keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-3
Deploying keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-2
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-1
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-2
Updating config to include new replica: 3
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-3
Stopping clickhouse server clickhouse-1: pid - 9902, child pid - 9906
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-1
Updating config to remove keeper: 2
Stopping keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-2 at pid 9899
Stopping keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-1 at pid 9892
        SLOW [> 60.000s] oximeter-db::integration_test test_cluster
test test_cluster has been running for over 60 seconds
        SLOW [>120.000s] oximeter-db::integration_test test_cluster
        SLOW [>180.000s] oximeter-db::integration_test test_cluster
        SLOW [>240.000s] oximeter-db::integration_test test_cluster

That shows we've killed the first keeper successfully, and also killed the second keeper. I can confirm that at least one of PIDs is indeed gone, but keeper-1 still appears alive (PID 9989):

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ pgrep clickhouse
9895
9905
9907
9944
9945
9989
9990

That process doesn't seem to be doing anything though. The log file has messages from a few minutes ago:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ tail /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-2/logs/clickhouse-keeper.log
2024.11.07 20:05:14.808814 [ 18720048 ] {} <Trace> RaftInstance: commit_cv_ notify (local thread)
2024.11.07 20:05:14.808817 [ 18720048 ] {} <Trace> RaftInstance: cancel existing timer
2024.11.07 20:05:14.808821 [ 18720048 ] {} <Trace> RaftInstance: re-schedule election timer
2024.11.07 20:05:14.808823 [ 18720052 ] {} <Trace> RaftInstance: commit_cv_ wake up
2024.11.07 20:05:14.808826 [ 18720048 ] {} <Trace> RaftInstance: batch size hint: 0 bytes
2024.11.07 20:05:14.808834 [ 18720048 ] {} <Debug> RaftInstance: Response back a append_entries_response message to 3 with Accepted=1, Term=1, NextIndex=626
2024.11.07 20:05:14.808828 [ 18720052 ] {} <Debug> RaftInstance: commit upto 625, current idx 624
2024.11.07 20:05:14.808854 [ 18720052 ] {} <Trace> RaftInstance: commit upto 625, current idx 625
2024.11.07 20:05:14.808863 [ 18720052 ] {} <Debug> RaftInstance: DONE: commit upto 625, current idx 625
2024.11.07 20:05:14.808866 [ 18720052 ] {} <Trace> RaftInstance: commit_cv_ sleep
bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ date
Thu Nov  7 20:12:12 PST 2024

What does appear to be moving is the other keeper. I see a spew of logs like this:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ tail -f /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-3/logs/clickhouse-keeper.log
2024.11.07 20:12:41.279210 [ 18720275 ] {} <Error> RaftInstance: 1 nodes (out of 2, 2 including learners) are not responding longer than 10000 ms, at least 2 nodes (including leader) should be alive to proceed commit
2024.11.07 20:12:41.279471 [ 18720275 ] {} <Warning> RaftInstance: we cannot yield the leadership of this small cluster
2024.11.07 20:12:41.279589 [ 18720275 ] {} <Debug> RaftInstance: heartbeat timeout for 1
2024.11.07 20:12:41.279632 [ 18720275 ] {} <Trace> RaftInstance: (update) new target priority: 1
2024.11.07 20:12:41.279708 [ 18720275 ] {} <Debug> RaftInstance: reset RPC client for peer 1
2024.11.07 20:12:41.280211 [ 18720275 ] {} <Trace> RaftInstance: asio client created: 0x14ed8fc18
2024.11.07 20:12:41.280302 [ 18720275 ] {} <Trace> RaftInstance: 0x14ed8fc18 reconnect peer 1
2024.11.07 20:12:41.280368 [ 18720275 ] {} <Trace> RaftInstance: new rpc for peer 1 is created, reset next log idx (685) and matched log idx (0)
2024.11.07 20:12:41.280403 [ 18720275 ] {} <Trace> RaftInstance: send request to 1
2024.11.07 20:12:41.280434 [ 18720275 ] {} <Trace> RaftInstance: last_log_idx: 684, starting_idx: 1, cur_nxt_idx: 685
2024.11.07 20:12:41.280500 [ 18720275 ] {} <Debug> RaftInstance: append_entries for 1 with LastLogIndex=684, LastLogTerm=1, EntriesLength=0, CommitIndex=682, Term=1, peer_last_sent_idx 685
2024.11.07 20:12:41.280534 [ 18720275 ] {} <Trace> RaftInstance: EMPTY PAYLOAD
2024.11.07 20:12:41.280585 [ 18720275 ] {} <Trace> RaftInstance: send req 3 -> 1, type append_entries_request
2024.11.07 20:12:41.280647 [ 18720275 ] {} <Debug> RaftInstance: socket 0x14ed8fc18 to ::1:19101 is not opened yet
2024.11.07 20:12:41.280747 [ 18720275 ] {} <Trace> RaftInstance: sent
2024.11.07 20:12:41.281509 [ 18720267 ] {} <Trace> RaftInstance: resp of req 3 -> 1, type append_entries_request, failed to connect to peer 1, ::1:19101, error 61, Connection refused
2024.11.07 20:12:41.281605 [ 18720267 ] {} <Trace> RaftInstance: asio client destroyed: 0x14ed8fc18
2024.11.07 20:12:41.438169 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 5, will try to close it
2024.11.07 20:12:41.438424 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438508 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 4, will try to close it
2024.11.07 20:12:41.438588 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438686 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 7, will try to close it
2024.11.07 20:12:41.438741 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438785 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 6, will try to close it
2024.11.07 20:12:41.438830 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438867 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 9, will try to close it
2024.11.07 20:12:41.438901 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
<SNIP>

It looks like keeper-3 is still trying to send messages to one of the other keepers, which I think is now dead. In the CI run that failed, these errors start appearing around here. My local run that's stalled seems to be trying to connect to [::1]:19101. It claims it's "not yet opened", which I don't really understand. But there is indeed no one listening on that so far as I can tell:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ lsof -i 6tcp@localhost:19101
bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $

I'm not sure if it's relevant, but the last thing we're attempting to do in the test itself is insert some samples. I don't know why this would have changed when we start inserting data with the native client.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

Also potentially interesting is that this wayward keeper doesn't seem to shutdown when I interrupt the stalled test locally. It's still trying its best to make a connection to [::1]:19101, which of course will never work.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

I know very little about the Raft protocol, but I do see this error in the keeper that refuses to die:

2024.11.07 20:26:30.302443 [ 18720267 ] {} <Error> RaftInstance: 1 nodes (out of 2, 2 including learners) are not responding longer than 10000 ms, at least 2 nodes (including leader) should be alive to proceed commit
2024.11.07 20:26:30.302599 [ 18720267 ] {} <Warning> RaftInstance: we cannot yield the leadership of this small cluster

It seems like we're trying to stop the cluster leader, which might be a problem. I find it unlikely that we've never tried to stop the leader, given there are only 3 nodes, but maybe something about the native connections we're using make this more obvious or something.

The second thing I see that's interesting is this doc page from ClickHouse. That describes manually recovering a Raft quorum, which might be needed in some cases. In particular:

Cluster configuration can be dynamically configured, but there are some limitations. Reconfiguration relies on Raft also so to add/remove a node from the cluster you need to have a quorum. If you lose too many nodes in your cluster at the same time without any chance of starting them again, Raft will stop working and not allow you to reconfigure your cluster using the conventional way.

This seems to be exactly what we're doing (intentionally) -- killing 2 / 3 of the nodes to destroy the quorum. I imagine the "without any chance of starting them again" bit of that doc is important, but it seems possible we're somehow in this dead state.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

Ok, well this part probably explains why the failure occurred after changing to the native client for inserting data:

info!(log, "Attempting to insert samples without keeper quorum");
let samples = oximeter_test_utils::generate_test_samples(
input.n_projects,
input.n_instances,
input.n_cpus,
input.n_samples,
);
// We are expecting a timeout here. Typical inserts take on the order of a
// few hundred milliseconds. To shorten the length of our test, we create a
// new client with a shorter timeout.
let client1_short_timeout = Client::new_with_request_timeout(
deployment.http_addr(1.into()),
deployment.native_addr(1.into()),
log,
Duration::from_secs(2),
);
// We have lost quorum and should not be able to insert
client1_short_timeout
.insert_samples(&samples)
.await
.expect_err("insert succeeded without keeper quorum");

We're setting a shorter timeout on the insertion request, but that only applies to the HTTP client itself. The TCP client does not have any explicit timeouts set at all, so the insertion request should just hang forever since both sides are alive. I'll see what happens if I set a shorter timeout on the use of the native connection.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

Using the shorter request timeout of 2s does indeed cause the tests to pass pretty reliably. I need to think a bit more about whether that's really what we want to do, but some kind of timeout on the connections seem useful. It's pretty heavy-handed to set a tokio::time::timeout() though -- I really want to time the request out if no work is being done, and not do so if the request is proceeding but just taking a while. I don't know how to do that in the code we have today, or in general with Tokio.

@andrewjstone
Copy link
Contributor

I know very little about the Raft protocol, but I do see this error in the keeper that refuses to die:

2024.11.07 20:26:30.302443 [ 18720267 ] {} <Error> RaftInstance: 1 nodes (out of 2, 2 including learners) are not responding longer than 10000 ms, at least 2 nodes (including leader) should be alive to proceed commit
2024.11.07 20:26:30.302599 [ 18720267 ] {} <Warning> RaftInstance: we cannot yield the leadership of this small cluster

It seems like we're trying to stop the cluster leader, which might be a problem. I find it unlikely that we've never tried to stop the leader, given there are only 3 nodes, but maybe something about the native connections we're using make this more obvious or something.

Stopping the leader is totally fine. In this case though, it looks like the leader is the one that wouldn't stop and a protocol optimization has determined that the leader doesn't yield in 2 node clusters.

The second thing I see that's interesting is this doc page from ClickHouse. That describes manually recovering a Raft quorum, which might be needed in some cases. In particular:

Cluster configuration can be dynamically configured, but there are some limitations. Reconfiguration relies on Raft also so to add/remove a node from the cluster you need to have a quorum. If you lose too many nodes in your cluster at the same time without any chance of starting them again, Raft will stop working and not allow you to reconfigure your cluster using the conventional way.

We definitely do not need to do this. This is for the case when quorum is permanently gone. Think 2 out of 3 disks catch fire. In that case we essentially promote the remaining node into a single node cluster. This virtually guarantees data loss, but is the only thing you can do in such a case.

This seems to be exactly what we're doing (intentionally) -- killing 2 / 3 of the nodes to destroy the quorum. I imagine the "without any chance of starting them again" bit of that doc is important, but it seems possible we're somehow in this dead state.

We are intentionally losing keeper quorum temporarily to test that we can still perform queries from clickhouse as queries don't require keeper operation. We are also checking that we cannot insert when we don't have quorum as the cluster becomes readonly in this case. Later in the test we bring the keeper in the 2 node cluster back up to ensure we can write again.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

Thanks @andrewjstone that's very helpful. That makes sense that the manual process is for a cluster that cannot ever recover on its own, though that wasn't really clear from the docs themselves.

We are intentionally losing keeper quorum temporarily to test that we can still perform queries from clickhouse as queries don't require keeper operation. We are also checking that we cannot insert when we don't have quorum as the cluster becomes readonly in this case.

Ok, so this test checks that, with no quorum, we can still read data, but cannot write it? I'm a little surprised the replica doesn't time out or fail the insertion itself, and that we need to rely on a client-side timeout. I'm a bit loathe to do that, since the only way I can see is the blunt instrument that is tokio::time::timeout() around the query future itself. Might there be a setting we can use to ask the replica to fail an insertion if there's no quorum? I remember a lot of settings that seem related, but nothing that is exactly what we're looking for.

@andrewjstone
Copy link
Contributor

Using the shorter request timeout of 2s does indeed cause the tests to pass pretty reliably. I need to think a bit more about whether that's really what we want to do, but some kind of timeout on the connections seem useful. It's pretty heavy-handed to set a tokio::time::timeout() though -- I really want to time the request out if no work is being done, and not do so if the request is proceeding but just taking a while. I don't know how to do that in the code we have today, or in general with Tokio.

This is just a short timeout for the test to make it go faster. We are checking to see if we fail to insert when we don't have quorum. It's a negative test, in which I was looking for an error. The only way to get an error without quorum is to have the client timeout.

If we were directly making a request to the keeper cluster we'd fail with a timeout or immediately (depending upon implementation) when there is no leader. But we aren't doing that. We are talking to the clickhouse servers which are still up and talk to the keepers in the background. My guess is that presume errors like this to be transient (due to leader rotation - yield) or leader failure and don't want to immediately fail client requests because of that by default. There may be a setting we can add to make the request fail immediately in this case. That would be my preference as well frankly.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2024

I'm looking back at the settings now, to see if there's one we can apply to inserts that occur without a keeper quorum.

@andrewjstone set me straight in chat, and I wanted to record my understanding here. I was under the impression that the keeper was somehow acting differently in the original PR commit (which did not have a query timeout), than it does on main (which uses the HTTP client with a timeout). That's not correct. The keeper is doing exactly what it has always been doing: trying to replicate its log to the other nodes, which don't exist, causing the insertion request to stall too. We just timed that out before at 2s.

That's exactly the same thing that the second commit in this PR does -- apply the same timeout to the interactions via the native client too. When that insertion times out, the test code moves on to the next step, which is bringing the quorum back. The keepers are thus happier, and things continue.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@bnaecker bnaecker enabled auto-merge (squash) November 9, 2024 19:51
@bnaecker bnaecker merged commit 53b77fd into main Nov 10, 2024
17 checks passed
@bnaecker bnaecker deleted the insert-samples-with-native-client branch November 10, 2024 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use native ClickHouse protocol to insert samples
2 participants