Insert samples using the native client #7013

bnaecker · 2024-11-07T22:54:07Z

Remove unneeded and error-prone n_columns and n_rows fields from the Block type. These are now methods that delegate to the actual column arrays, which are the source of truth anyway.
Add methods for extracting fields from a sample as a set of data blocks, one per field table.
Add methods to extract measurements from a sample as a block destined for one table. Take care to correctly construct missing samples, especially for histograms. Also implement the FromBlock trait for samples, to ensure we can extract the raw data as well. This is only used in tests, in this commit.
Insert fields and measurements from a sample as a data block, using the native interface.
Fix serde of UUIDs in native format, which doesn't match the documentation at least for our current version of ClickHouse.
Remove code serializing fields and measurements into JSON.
Closes Use native ClickHouse protocol to insert samples #6884

- Remove unneeded and error-prone `n_columns` and `n_rows` fields from the `Block` type. These are now methods that delegate to the actual column arrays, which are the source of truth anyway. - Add methods for extracting fields from a sample as a set of data blocks, one per field table. - Add methods to extract measurements from a sample as a block destined for one table. Take care to correctly construct missing samples, especially for histograms. Also implement the `FromBlock` trait for samples, to ensure we can extract the raw data as well. This is only used in tests, in this commit. - Insert fields and measurements from a sample as a data block, using the native interface. - Fix serde of UUIDs in native format, which doesn't match the documentation at least for our current version of ClickHouse. - Remove code serializing fields and measurements into JSON. - Closes #6884

bnaecker · 2024-11-08T04:12:59Z

Interesting. The test failure here is the integration test that tries to insert / select various data on a replicated cluster, while killing and restarting both data replicas and keeper nodes. I've run this test locally a few times, and it seems to work sometimes, but more often than not hangs here:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ cargo nextest run test_cluster --nocapture
info: experimental features enabled: setup-scripts
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.30s
------------
 Nextest run ID 00ce8fb1-176a-4d65-8d5f-5b5b9e4b0119 with nextest profile: default
    Starting 1 test across 3 binaries (229 tests skipped)
       START             oximeter-db::integration_test test_cluster

running 1 test
log file: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.0.log
note: configured to log to "/var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.0.log"
Deploying keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-1
Deploying keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-3
Deploying keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-2
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-1
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-2
Updating config to include new replica: 3
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-3
Stopping clickhouse server clickhouse-1: pid - 9902, child pid - 9906
Deploying clickhouse server: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/clickhouse-1
Updating config to remove keeper: 2
Stopping keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-2 at pid 9899
Stopping keeper: /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-1 at pid 9892
        SLOW [> 60.000s] oximeter-db::integration_test test_cluster
test test_cluster has been running for over 60 seconds
        SLOW [>120.000s] oximeter-db::integration_test test_cluster
        SLOW [>180.000s] oximeter-db::integration_test test_cluster
        SLOW [>240.000s] oximeter-db::integration_test test_cluster

That shows we've killed the first keeper successfully, and also killed the second keeper. I can confirm that at least one of PIDs is indeed gone, but keeper-1 still appears alive (PID 9989):

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ pgrep clickhouse
9895
9905
9907
9944
9945
9989
9990

That process doesn't seem to be doing anything though. The log file has messages from a few minutes ago:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ tail /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-2/logs/clickhouse-keeper.log
2024.11.07 20:05:14.808814 [ 18720048 ] {} <Trace> RaftInstance: commit_cv_ notify (local thread)
2024.11.07 20:05:14.808817 [ 18720048 ] {} <Trace> RaftInstance: cancel existing timer
2024.11.07 20:05:14.808821 [ 18720048 ] {} <Trace> RaftInstance: re-schedule election timer
2024.11.07 20:05:14.808823 [ 18720052 ] {} <Trace> RaftInstance: commit_cv_ wake up
2024.11.07 20:05:14.808826 [ 18720048 ] {} <Trace> RaftInstance: batch size hint: 0 bytes
2024.11.07 20:05:14.808834 [ 18720048 ] {} <Debug> RaftInstance: Response back a append_entries_response message to 3 with Accepted=1, Term=1, NextIndex=626
2024.11.07 20:05:14.808828 [ 18720052 ] {} <Debug> RaftInstance: commit upto 625, current idx 624
2024.11.07 20:05:14.808854 [ 18720052 ] {} <Trace> RaftInstance: commit upto 625, current idx 625
2024.11.07 20:05:14.808863 [ 18720052 ] {} <Debug> RaftInstance: DONE: commit upto 625, current idx 625
2024.11.07 20:05:14.808866 [ 18720052 ] {} <Trace> RaftInstance: commit_cv_ sleep
bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ date
Thu Nov  7 20:12:12 PST 2024

What does appear to be moving is the other keeper. I see a spew of logs like this:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ tail -f /var/folders/67/2tlym22x1r3d2kwbh84j298w0000gn/T/integration_test-f999b8c0e7b2a5bc-test_cluster.9889.1-oximeter-clickward-test/keeper-3/logs/clickhouse-keeper.log
2024.11.07 20:12:41.279210 [ 18720275 ] {} <Error> RaftInstance: 1 nodes (out of 2, 2 including learners) are not responding longer than 10000 ms, at least 2 nodes (including leader) should be alive to proceed commit
2024.11.07 20:12:41.279471 [ 18720275 ] {} <Warning> RaftInstance: we cannot yield the leadership of this small cluster
2024.11.07 20:12:41.279589 [ 18720275 ] {} <Debug> RaftInstance: heartbeat timeout for 1
2024.11.07 20:12:41.279632 [ 18720275 ] {} <Trace> RaftInstance: (update) new target priority: 1
2024.11.07 20:12:41.279708 [ 18720275 ] {} <Debug> RaftInstance: reset RPC client for peer 1
2024.11.07 20:12:41.280211 [ 18720275 ] {} <Trace> RaftInstance: asio client created: 0x14ed8fc18
2024.11.07 20:12:41.280302 [ 18720275 ] {} <Trace> RaftInstance: 0x14ed8fc18 reconnect peer 1
2024.11.07 20:12:41.280368 [ 18720275 ] {} <Trace> RaftInstance: new rpc for peer 1 is created, reset next log idx (685) and matched log idx (0)
2024.11.07 20:12:41.280403 [ 18720275 ] {} <Trace> RaftInstance: send request to 1
2024.11.07 20:12:41.280434 [ 18720275 ] {} <Trace> RaftInstance: last_log_idx: 684, starting_idx: 1, cur_nxt_idx: 685
2024.11.07 20:12:41.280500 [ 18720275 ] {} <Debug> RaftInstance: append_entries for 1 with LastLogIndex=684, LastLogTerm=1, EntriesLength=0, CommitIndex=682, Term=1, peer_last_sent_idx 685
2024.11.07 20:12:41.280534 [ 18720275 ] {} <Trace> RaftInstance: EMPTY PAYLOAD
2024.11.07 20:12:41.280585 [ 18720275 ] {} <Trace> RaftInstance: send req 3 -> 1, type append_entries_request
2024.11.07 20:12:41.280647 [ 18720275 ] {} <Debug> RaftInstance: socket 0x14ed8fc18 to ::1:19101 is not opened yet
2024.11.07 20:12:41.280747 [ 18720275 ] {} <Trace> RaftInstance: sent
2024.11.07 20:12:41.281509 [ 18720267 ] {} <Trace> RaftInstance: resp of req 3 -> 1, type append_entries_request, failed to connect to peer 1, ::1:19101, error 61, Connection refused
2024.11.07 20:12:41.281605 [ 18720267 ] {} <Trace> RaftInstance: asio client destroyed: 0x14ed8fc18
2024.11.07 20:12:41.438169 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 5, will try to close it
2024.11.07 20:12:41.438424 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438508 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 4, will try to close it
2024.11.07 20:12:41.438588 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438686 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 7, will try to close it
2024.11.07 20:12:41.438741 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438785 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 6, will try to close it
2024.11.07 20:12:41.438830 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
2024.11.07 20:12:41.438867 [ 18720619 ] {} <Information> KeeperDispatcher: Found dead session 9, will try to close it
2024.11.07 20:12:41.438901 [ 18720619 ] {} <Information> KeeperDispatcher: Dead session close request pushed
<SNIP>

It looks like keeper-3 is still trying to send messages to one of the other keepers, which I think is now dead. In the CI run that failed, these errors start appearing around here. My local run that's stalled seems to be trying to connect to [::1]:19101. It claims it's "not yet opened", which I don't really understand. But there is indeed no one listening on that so far as I can tell:

bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $ lsof -i 6tcp@localhost:19101
bnaecker@flint : ~/file-cabinet/oxide/omicron/oximeter/db $

I'm not sure if it's relevant, but the last thing we're attempting to do in the test itself is insert some samples. I don't know why this would have changed when we start inserting data with the native client.

bnaecker · 2024-11-08T04:31:22Z

Also potentially interesting is that this wayward keeper doesn't seem to shutdown when I interrupt the stalled test locally. It's still trying its best to make a connection to [::1]:19101, which of course will never work.

bnaecker · 2024-11-08T04:53:37Z

I know very little about the Raft protocol, but I do see this error in the keeper that refuses to die:

2024.11.07 20:26:30.302443 [ 18720267 ] {} <Error> RaftInstance: 1 nodes (out of 2, 2 including learners) are not responding longer than 10000 ms, at least 2 nodes (including leader) should be alive to proceed commit
2024.11.07 20:26:30.302599 [ 18720267 ] {} <Warning> RaftInstance: we cannot yield the leadership of this small cluster

It seems like we're trying to stop the cluster leader, which might be a problem. I find it unlikely that we've never tried to stop the leader, given there are only 3 nodes, but maybe something about the native connections we're using make this more obvious or something.

The second thing I see that's interesting is this doc page from ClickHouse. That describes manually recovering a Raft quorum, which might be needed in some cases. In particular:

Cluster configuration can be dynamically configured, but there are some limitations. Reconfiguration relies on Raft also so to add/remove a node from the cluster you need to have a quorum. If you lose too many nodes in your cluster at the same time without any chance of starting them again, Raft will stop working and not allow you to reconfigure your cluster using the conventional way.

This seems to be exactly what we're doing (intentionally) -- killing 2 / 3 of the nodes to destroy the quorum. I imagine the "without any chance of starting them again" bit of that doc is important, but it seems possible we're somehow in this dead state.

bnaecker · 2024-11-08T05:08:13Z

Ok, well this part probably explains why the failure occurred after changing to the native client for inserting data:

omicron/oximeter/db/tests/integration_test.rs

Lines 325 to 345 in 5660966

    
           info!(log, "Attempting to insert samples without keeper quorum"); 
        
           let samples = oximeter_test_utils::generate_test_samples( 
        
               input.n_projects, 
        
               input.n_instances, 
        
               input.n_cpus, 
        
               input.n_samples, 
        
           ); 
        
           // We are expecting a timeout here. Typical inserts take on the order of a 
        
           // few hundred milliseconds. To shorten the length of our test, we create a 
        
           // new client with a shorter timeout. 
        
           let client1_short_timeout = Client::new_with_request_timeout( 
        
               deployment.http_addr(1.into()), 
        
               deployment.native_addr(1.into()), 
        
               log, 
        
               Duration::from_secs(2), 
        
           ); 
        
           // We have lost quorum and should not be able to insert 
        
           client1_short_timeout 
        
               .insert_samples(&samples) 
        
               .await 
        
               .expect_err("insert succeeded without keeper quorum");

We're setting a shorter timeout on the insertion request, but that only applies to the HTTP client itself. The TCP client does not have any explicit timeouts set at all, so the insertion request should just hang forever since both sides are alive. I'll see what happens if I set a shorter timeout on the use of the native connection.

bnaecker · 2024-11-08T05:20:13Z

Using the shorter request timeout of 2s does indeed cause the tests to pass pretty reliably. I need to think a bit more about whether that's really what we want to do, but some kind of timeout on the connections seem useful. It's pretty heavy-handed to set a tokio::time::timeout() though -- I really want to time the request out if no work is being done, and not do so if the request is proceeding but just taking a while. I don't know how to do that in the code we have today, or in general with Tokio.

andrewjstone · 2024-11-08T17:36:42Z

I know very little about the Raft protocol, but I do see this error in the keeper that refuses to die:
2024.11.07 20:26:30.302443 [ 18720267 ] {} <Error> RaftInstance: 1 nodes (out of 2, 2 including learners) are not responding longer than 10000 ms, at least 2 nodes (including leader) should be alive to proceed commit
2024.11.07 20:26:30.302599 [ 18720267 ] {} <Warning> RaftInstance: we cannot yield the leadership of this small cluster
It seems like we're trying to stop the cluster leader, which might be a problem. I find it unlikely that we've never tried to stop the leader, given there are only 3 nodes, but maybe something about the native connections we're using make this more obvious or something.

Stopping the leader is totally fine. In this case though, it looks like the leader is the one that wouldn't stop and a protocol optimization has determined that the leader doesn't yield in 2 node clusters.

The second thing I see that's interesting is this doc page from ClickHouse. That describes manually recovering a Raft quorum, which might be needed in some cases. In particular:

Cluster configuration can be dynamically configured, but there are some limitations. Reconfiguration relies on Raft also so to add/remove a node from the cluster you need to have a quorum. If you lose too many nodes in your cluster at the same time without any chance of starting them again, Raft will stop working and not allow you to reconfigure your cluster using the conventional way.

We definitely do not need to do this. This is for the case when quorum is permanently gone. Think 2 out of 3 disks catch fire. In that case we essentially promote the remaining node into a single node cluster. This virtually guarantees data loss, but is the only thing you can do in such a case.

This seems to be exactly what we're doing (intentionally) -- killing 2 / 3 of the nodes to destroy the quorum. I imagine the "without any chance of starting them again" bit of that doc is important, but it seems possible we're somehow in this dead state.

We are intentionally losing keeper quorum temporarily to test that we can still perform queries from clickhouse as queries don't require keeper operation. We are also checking that we cannot insert when we don't have quorum as the cluster becomes readonly in this case. Later in the test we bring the keeper in the 2 node cluster back up to ensure we can write again.

bnaecker · 2024-11-08T17:47:20Z

Thanks @andrewjstone that's very helpful. That makes sense that the manual process is for a cluster that cannot ever recover on its own, though that wasn't really clear from the docs themselves.

We are intentionally losing keeper quorum temporarily to test that we can still perform queries from clickhouse as queries don't require keeper operation. We are also checking that we cannot insert when we don't have quorum as the cluster becomes readonly in this case.

Ok, so this test checks that, with no quorum, we can still read data, but cannot write it? I'm a little surprised the replica doesn't time out or fail the insertion itself, and that we need to rely on a client-side timeout. I'm a bit loathe to do that, since the only way I can see is the blunt instrument that is tokio::time::timeout() around the query future itself. Might there be a setting we can use to ask the replica to fail an insertion if there's no quorum? I remember a lot of settings that seem related, but nothing that is exactly what we're looking for.

andrewjstone · 2024-11-08T17:48:13Z

Using the shorter request timeout of 2s does indeed cause the tests to pass pretty reliably. I need to think a bit more about whether that's really what we want to do, but some kind of timeout on the connections seem useful. It's pretty heavy-handed to set a tokio::time::timeout() though -- I really want to time the request out if no work is being done, and not do so if the request is proceeding but just taking a while. I don't know how to do that in the code we have today, or in general with Tokio.

This is just a short timeout for the test to make it go faster. We are checking to see if we fail to insert when we don't have quorum. It's a negative test, in which I was looking for an error. The only way to get an error without quorum is to have the client timeout.

If we were directly making a request to the keeper cluster we'd fail with a timeout or immediately (depending upon implementation) when there is no leader. But we aren't doing that. We are talking to the clickhouse servers which are still up and talk to the keepers in the background. My guess is that presume errors like this to be transient (due to leader rotation - yield) or leader failure and don't want to immediately fail client requests because of that by default. There may be a setting we can add to make the request fail immediately in this case. That would be my preference as well frankly.

bnaecker · 2024-11-08T17:58:38Z

I'm looking back at the settings now, to see if there's one we can apply to inserts that occur without a keeper quorum.

@andrewjstone set me straight in chat, and I wanted to record my understanding here. I was under the impression that the keeper was somehow acting differently in the original PR commit (which did not have a query timeout), than it does on main (which uses the HTTP client with a timeout). That's not correct. The keeper is doing exactly what it has always been doing: trying to replicate its log to the other nodes, which don't exist, causing the insertion request to stall too. We just timed that out before at 2s.

That's exactly the same thing that the second commit in this PR does -- apply the same timeout to the interactions via the native client too. When that insertion times out, the test code moves on to the next step, which is bringing the quorum back. The keepers are thus happier, and things continue.

andrewjstone

Looks good!

oximeter/db/src/query.rs

oximeter/db/src/native/io/column.rs

oximeter/db/src/native/block.rs

oximeter/db/src/model/fields.rs

bnaecker force-pushed the insert-samples-with-native-client branch 2 times, most recently from 9acd3a6 to 0e03d95 Compare November 8, 2024 00:34

bnaecker force-pushed the insert-samples-with-native-client branch from 0e03d95 to 14fe026 Compare November 8, 2024 00:36

bnaecker mentioned this pull request Nov 8, 2024

Use ClickHouse native client to read database tables #7016

Merged

bnaecker requested review from andrewjstone and smklein November 8, 2024 03:22

Add request timeout to queries using the native client

0126960

andrewjstone approved these changes Nov 8, 2024

View reviewed changes

oximeter/db/src/query.rs Show resolved Hide resolved

oximeter/db/src/native/io/column.rs Show resolved Hide resolved

oximeter/db/src/native/block.rs Show resolved Hide resolved

oximeter/db/src/model/fields.rs Show resolved Hide resolved

Add constants and helpers for common column names

27b3468

bnaecker enabled auto-merge (squash) November 9, 2024 19:51

bnaecker merged commit 53b77fd into main Nov 10, 2024
17 checks passed

bnaecker deleted the insert-samples-with-native-client branch November 10, 2024 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Insert samples using the native client #7013

Insert samples using the native client #7013

Uh oh!

bnaecker commented Nov 7, 2024

Uh oh!

bnaecker commented Nov 8, 2024 •

edited

Loading

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

andrewjstone commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

andrewjstone commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

andrewjstone left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Insert samples using the native client #7013

Insert samples using the native client #7013

Uh oh!

Conversation

bnaecker commented Nov 7, 2024

Uh oh!

bnaecker commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

andrewjstone commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

andrewjstone commented Nov 8, 2024

Uh oh!

bnaecker commented Nov 8, 2024

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bnaecker commented Nov 8, 2024 •

edited

Loading