Description
When running a 2 replica 3 coordinator (keeper) clickhouse set up on Helios I came across an issue where the Clickhouse servers were unable to sync with the keepers.
The 3 keepers were healthy and had formed a quorum. But the servers kept timing out. I could not find any issues with the configuration files.
I decided to run the same config and install all of the replica and keeper nodes on a single server (to rule out any networking issues) on both Helios and Linux.
Turns out, the same configuration works with the Linux binary (the one created by the garbage compactor), but errors out with the Helios binary:
On Linux:
coatlicue@pop-os:~/src/ch-test/22.8.9.24-linux/oximeter_cluster/r1$ ./clickhouse client --port 9001
ClickHouse client version 22.8.9.24 (official build).
Connecting to localhost:9001 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.
oximeter_cluster node 1 :) SHOW CLUSTERS
SHOW CLUSTERS
Query id: 8726842f-c2ce-4f1f-b139-b4119c792e3d
┌─cluster──────────┐
│ oximeter_cluster │
└──────────────────┘
1 row in set. Elapsed: 0.002 sec.
oximeter_cluster node 1 :) CREATE DATABASE IF NOT EXISTS oximeter ON CLUSTER oximeter_cluster;
CREATE DATABASE IF NOT EXISTS oximeter ON CLUSTER oximeter_cluster
Query id: fd1fcc1b-6f25-48f7-8533-884e0acd7f99
┌─host──────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 127.0.0.1 │ 9002 │ 0 │ │ 1 │ 0 │
│ 127.0.0.1 │ 9001 │ 0 │ │ 0 │ 0 │
└───────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘
2 rows in set. Elapsed: 0.135 sec.
oximeter_cluster node 1 :) exit
Bye.
coatlicue@pop-os:~/src/ch-test/22.8.9.24-linux/oximeter_cluster/r1$ ./clickhouse client --port 9002
ClickHouse client version 22.8.9.24 (official build).
Connecting to localhost:9002 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.
oximeter_cluster node 2 :) SHOW DATABASES
SHOW DATABASES
Query id: c6f390cf-cd8f-4062-980a-5e920f945216
┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default │
│ information_schema │
│ oximeter │<-- Database just created from the other node
│ system │
└────────────────────┘
5 rows in set. Elapsed: 0.002 sec.
On Helios:
karen@atrium ~/src/ch-testing/r1 $ ./clickhouse client --port 9001
ClickHouse client version 22.8.9.1.
Connecting to localhost:9001 as user default.
Connected to ClickHouse server version 22.8.9 revision 54460.
oximeter_cluster node 1 :) SHOW CLUSTERS
SHOW CLUSTERS
Query id: 7e2155b8-73d6-497b-af1c-30932e5e6442
┌─cluster──────────┐
│ oximeter_cluster │
└──────────────────┘
1 row in set. Elapsed: 0.002 sec.
oximeter_cluster node 1 :) CREATE DATABASE IF NOT EXISTS oximeter ON CLUSTER oximeter_cluster;
CREATE DATABASE IF NOT EXISTS oximeter ON CLUSTER oximeter_cluster
Query id: af5bde07-3f38-484d-b7e7-1b49ae92e105
0 rows in set. Elapsed: 16.256 sec.
Received exception from server (version 22.8.9):
Code: 999. DB::Exception: Received from localhost:9001. Coordination::Exception. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION)
From the logs on the Helios machine:
2023.07.28 04:39:35.201191 [ 212 ] {} <Error> ZooKeeperClient: Code: 999. Coordination::Exception: Operation timeout (no response) for request Get for path: /keeper/api_version (Operation timeout). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
0. Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 14a9c432 in /home/karen/src/ch-testing/r1/clickhouse
1. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ d3f0b20 in /home/karen/src/ch-testing/r1/clickhouse
2. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 12d7055a in /home/karen/src/ch-testing/r1/clickhouse
3. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error) @ 12d7076a in /home/karen/src/ch-testing/r1/clickhouse
4. Coordination::ZooKeeper::receiveThread() @ 12da20ee in /home/karen/src/ch-testing/r1/clickhouse
5. void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<ThreadFromGlobalPool::ThreadFromGlobalPool<Coordination::ZooKeeper::ZooKeeper(std::__1::vector<Coordination::ZooKeeper::Node, std::__1::allocator<Coordination::ZooKeeper::Node> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Poco::Timespan, Poco::Timespan, Poco::Timespan, std::__1::shared_ptr<DB::ZooKeeperLog>)::$_1>(Coordination::ZooKeeper::ZooKeeper(std::__1::vector<Coordination::ZooKeeper::Node, std::__1::allocator<Coordination::ZooKeeper::Node> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Poco::Timespan, Poco::Timespan, Poco::Timespan, std::__1::shared_ptr<DB::ZooKeeperLog>)::$_1&&)::'lambda'(), void ()> >(std::__1::__function::__policy_storage const*) @ 12da6133 in /home/karen/src/ch-testing/r1/clickhouse
6. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 1468ddd4 in /home/karen/src/ch-testing/r1/clickhouse
7. void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda0'()> >(void*) @ 1468fc35 in /home/karen/src/ch-testing/r1/clickhouse
8. _thrp_setup @ 111da7 in /lib/amd64/libc.so.1
(version 22.8.9.1)
2023.07.28 04:39:35.201493 [ 194 ] {} <Trace> ZooKeeperClient: Failed to get API version
2023.07.28 04:39:35.201593 [ 194 ] {} <Trace> ZooKeeper: Initialized, hosts: 127.0.0.1:9181,127.0.0.1:9182,127.0.0.1:9183
2023.07.28 04:39:35.202108 [ 194 ] {} <Error> virtual bool DB::DDLWorker::initializeMainThread(): Code: 999. Coordination::Exception: Session expired (Session expired). (KEEPER_EXCEPTION), Stack trace (when copying this message, always include the lines below):
0. Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 14a9c432 in /home/karen/src/ch-testing/r1/clickhouse
1. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ d3f0b20 in /home/karen/src/ch-testing/r1/clickhouse
2. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error, int) @ 12d7055a in /home/karen/src/ch-testing/r1/clickhouse
3. Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, Coordination::Error) @ 12d7076a in /home/karen/src/ch-testing/r1/clickhouse
4. Coordination::ZooKeeper::pushRequest(Coordination::ZooKeeper::RequestInfo&&) @ 12da3917 in /home/karen/src/ch-testing/r1/clickhouse
5. Coordination::ZooKeeper::create(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, bool, std::__1::vector<Coordination::ACL, std::__1::allocator<Coordination::ACL> > const&, std::__1::function<void (Coordination::CreateResponse const&)>) @ 12da428b in /home/karen/src/ch-testing/r1/clickhouse
6. zkutil::ZooKeeper::asyncTryCreateNoThrow(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 12d769e4 in /home/karen/src/ch-testing/r1/clickhouse
7. zkutil::ZooKeeper::createImpl(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&) @ 12d7654f in /home/karen/src/ch-testing/r1/clickhouse
8. zkutil::ZooKeeper::createIfNotExists(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 12d76e4a in /home/karen/src/ch-testing/r1/clickhouse
9. zkutil::ZooKeeper::createAncestors(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) @ 12d76f96 in /home/karen/src/ch-testing/r1/clickhouse
10. DB::DDLWorker::initializeMainThread() @ 1162e832 in /home/karen/src/ch-testing/r1/clickhouse
11. DB::DDLWorker::runMainThread() @ 116203c0 in /home/karen/src/ch-testing/r1/clickhouse
12. ThreadFromGlobalPool::ThreadFromGlobalPool<void (DB::DDLWorker::*)(), DB::DDLWorker*>(void (DB::DDLWorker::*&&)(), DB::DDLWorker*&&)::'lambda'()::operator()() @ 116302b4 in /home/karen/src/ch-testing/r1/clickhouse
13. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @ 1468ddd4 in /home/karen/src/ch-testing/r1/clickhouse
14. void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda0'()> >(void*) @ 1468fc35 in /home/karen/src/ch-testing/r1/clickhouse
15. _thrp_setup @ 111da7 in /lib/amd64/libc.so.1
(version 22.8.9.1)
I have absolutely no idea how debug/fix the garbage compactor build scripts. Would be happy to sync up with someone who can spare a bit of time to help me perhaps @citrus-it, @jclulow or @davepacheco ?