-
Notifications
You must be signed in to change notification settings - Fork 101
Worker Heartbeat: Plumb metrics and migrate to runtime/namespace/client level #1038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* worker heartbeat * Address Spencer's comments * wip use client_identity_override as part of key, added test * Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker * Verified client replacement works, need to update tests and cleanup * formating * clean up * forgot to remove new() now that using builder pattern * Switch to worker_set_key * Replace client test passes, need to write unit tests in worker_registry * cargo test-lint * limit nexus to 1 poller, add tests for worker_registry for heartbeat * PR comments * new test helper * Return error on multi worker register for same namespace and task queue on same client * cargo fmt * Fix registration order, unique task queue for test worker * Remove TEST_Q variable * Missing quotes * CI lint and docker test fix, rename worker_set_key to worker_grouping_key * clippy bug
…eat data (temporalio#1023) * plumb in memory metrics * simplify worker::new(), fix some heartbeat metrics, new test file * CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then() * Support in-mem metrics when metrics aren't configured * Move sys_info refresh to dedicated thread, use tuner's existing sys info * Format, AtomicCell * Fix unit test * Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added * Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test * recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement * MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests * More tests, sticky cache miss, plugins * Formatting, fix skip_client_worker_set_check * Cursor found a bug * Lower sleep time, add print for debugging * more prints * use semaphores for worker_heartbeat_failure_metrics * skip_client_worker_set_check for all integ workers * Can't use tokio semaphore in workflow code * use signal to test workflow_slots.last_interval_failure_tasks * Use Notify instead of semaphores, fix test flake * Use eventually() instead of a manual sleep * max_outstanding_workflow_tasks 2
# Conflicts: # client/src/raw.rs # core-c-bridge/src/client.rs # core/src/lib.rs # core/src/worker/client.rs # core/src/worker/mod.rs # tests/common/mod.rs # tests/integ_tests/polling_tests.rs
Sushisource
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving since this was reviewed separately
Sushisource
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry, one thing I want to double check is fixed in this branch before we merge because I just saw it while testing some metrics changes:
I'm seeing:
temporal_long_request_latency_bucket{namespace="default",operation="PollWorkflowTaskQueue",service_name="temporal-core-sdk",task_queue="integ_tester-4ae9d50f7ae94eb5be295f8086003b03",le="2500"} 1
In metrics output, which is I believe the worker heartbeating task queue name, but, it should not be making any poll workflow task calls. Just want to double-check that's fixed in this branch and we have a test for it.
Sushisource
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind that. That's the sticky queue name and I just forgot that was the convention it followed.
| heartbeat_map, | ||
| namespace, | ||
| cancel, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Heartbeat Failure Due to Error Suppression
The SharedNamespaceWorker::new function can fail, but its error isn't propagated. This leads to a non-functional heartbeat mechanism and an inconsistent ClientWorkerRegistrator state, where heartbeat capability is indicated without a registered callback.
What was changed
Primarily a combination of #983 and #1023
This also turns on heartbeating by default
Why?
Finish and enable new worker heartbeating feature.
Checklist
Closes
How was this tested:
Added unit and integration tests
Note
Enable runtime-configured worker heartbeating, replace SlotManager with ClientWorkerSet, plumb in‑memory metrics and poller timing, and update client/worker/C-bridge APIs with comprehensive tests.
heartbeat_interval; send periodic heartbeats and final shutdown heartbeat withWorkerStatus.RuntimeOptions{ telemetry_options, heartbeat_interval }and builders; updateCoreRuntime::new*to use it.worker_instance_key(UUID), status, and improvedreplace_client(re-registers, returns Result).max_cached_workflows; remove globalTEST_Qin tests.SlotManagerwithClientWorkerSet(registration, grouping key, shared namespace worker for heartbeats).ClientWorker/SharedNamespaceWorkerTraitand heartbeat callback wiring.WorkerClientAPI:identity()(wasget_identity),workers()returnsClientWorkerSet,worker_grouping_key(),record_worker_heartbeat(namespace, Vec<WorkerHeartbeat>),shutdown_worker(..., final_heartbeat).HeartbeatMetricType) andWorkerHeartbeatMetrics; extendCoreMeterto create instruments with in-memory mirrors.Worker::worker_instance_key(); addPluginInfo; exportuuiddep.TemporalCoreRuntimeOptionsgainsworker_heartbeat_duration_millis.temporal_core_worker_replace_clientreturns error string on failure.ClientWorker,ClientWorkerSet,HeartbeatCallback,SharedNamespaceWorkerTraitand worker-group key accessors.Written by Cursor Bugbot for commit c30e25e. This will update automatically on new commits. Configure here.