Skip to content

Conversation

@yuandrew
Copy link
Contributor

@yuandrew yuandrew commented Oct 17, 2025

What was changed

Primarily a combination of #983 and #1023

This also turns on heartbeating by default

Why?

Finish and enable new worker heartbeating feature.

Checklist

  1. Closes

  2. How was this tested:

Added unit and integration tests

  1. Any docs updates needed?

Note

Enable runtime-configured worker heartbeating, replace SlotManager with ClientWorkerSet, plumb in‑memory metrics and poller timing, and update client/worker/C-bridge APIs with comprehensive tests.

  • Core:
    • Introduce worker heartbeating with runtime-level heartbeat_interval; send periodic heartbeats and final shutdown heartbeat with WorkerStatus.
    • Add RuntimeOptions{ telemetry_options, heartbeat_interval } and builders; update CoreRuntime::new* to use it.
    • Track last successful poll times in pollers; expose slot supplier kind; propagate sticky/non-sticky poller info.
    • Worker now has worker_instance_key (UUID), status, and improved replace_client (re-registers, returns Result).
    • Refactor worker initialization: compute sticky queue via max_cached_workflows; remove global TEST_Q in tests.
  • Client:
    • Replace SlotManager with ClientWorkerSet (registration, grouping key, shared namespace worker for heartbeats).
    • New ClientWorker/SharedNamespaceWorkerTrait and heartbeat callback wiring.
    • Update WorkerClient API: identity() (was get_identity), workers() returns ClientWorkerSet, worker_grouping_key(), record_worker_heartbeat(namespace, Vec<WorkerHeartbeat>), shutdown_worker(..., final_heartbeat).
  • Telemetry/Metrics:
    • Add in-memory metric tracking (HeartbeatMetricType) and WorkerHeartbeatMetrics; extend CoreMeter to create instruments with in-memory mirrors.
    • Counters/gauges/histograms used by heartbeat now dual-record to in-memory stores.
  • Core-API:
    • Add Worker::worker_instance_key(); add PluginInfo; export uuid dep.
  • C-bridge:
    • TemporalCoreRuntimeOptions gains worker_heartbeat_duration_millis.
    • temporal_core_worker_replace_client returns error string on failure.
  • Client crate exports:
    • Expose ClientWorker, ClientWorkerSet, HeartbeatCallback, SharedNamespaceWorkerTrait and worker-group key accessors.
  • Tests:
    • Extensive unit/integration tests for heartbeating, metrics, registration conflicts, and client replacement; adapt to API/behavior changes.

Written by Cursor Bugbot for commit c30e25e. This will update automatically on new commits. Configure here.

yuandrew and others added 3 commits September 24, 2025 14:24
* worker heartbeat

* Address Spencer's comments

* wip use client_identity_override as part of key, added test

* Refactor almost complete, need to plumb through telemetry to SharedNamespaceWorker

* Verified client replacement works, need to update tests and cleanup

* formating

* clean up

* forgot to remove new() now that using builder pattern

* Switch to worker_set_key

* Replace client test passes, need to write unit tests in worker_registry

* cargo test-lint

* limit nexus to 1 poller, add tests for worker_registry for heartbeat

* PR comments

* new test helper

* Return error on multi worker register for same namespace and task queue on same client

* cargo fmt

* Fix registration order, unique task queue for test worker

* Remove TEST_Q variable

* Missing quotes

* CI lint and docker test fix, rename worker_set_key to worker_grouping_key

* clippy bug
…eat data (temporalio#1023)

* plumb in memory metrics

* simplify worker::new(), fix some heartbeat metrics, new test file

* CounterImpl, final_heartbeat, more specific metric label dbg_panic msg, counter_with_in_mem and and_then()

* Support in-mem metrics when metrics aren't configured

* Move sys_info refresh to dedicated thread, use tuner's existing sys info

* Format, AtomicCell

* Fix unit test

* Set dynamic config for WorkerHeartbeatsEnabled and ListWorkersEnabled, remove stale metric previously added

* Should not expect heartbeat nexus worker in metrics for non-heartbeating integ test

* recv_timeout instead of thread::sleep, use WorkflowService::list_workers directly, WithLabel API improvement

* MetricAttributes::NoOp, add mechanism to ignore dupe workers for testing, more tests

* More tests, sticky cache miss, plugins

* Formatting, fix skip_client_worker_set_check

* Cursor found a bug

* Lower sleep time, add print for debugging

* more prints

* use semaphores for worker_heartbeat_failure_metrics

* skip_client_worker_set_check for all integ workers

* Can't use tokio semaphore in workflow code

* use signal to test workflow_slots.last_interval_failure_tasks

* Use Notify instead of semaphores, fix test flake

* Use eventually() instead of a manual sleep

* max_outstanding_workflow_tasks 2
@yuandrew yuandrew requested a review from a team as a code owner October 17, 2025 19:36
# Conflicts:
#	client/src/raw.rs
#	core-c-bridge/src/client.rs
#	core/src/lib.rs
#	core/src/worker/client.rs
#	core/src/worker/mod.rs
#	tests/common/mod.rs
#	tests/integ_tests/polling_tests.rs
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since this was reviewed separately

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, one thing I want to double check is fixed in this branch before we merge because I just saw it while testing some metrics changes:

I'm seeing:

temporal_long_request_latency_bucket{namespace="default",operation="PollWorkflowTaskQueue",service_name="temporal-core-sdk",task_queue="integ_tester-4ae9d50f7ae94eb5be295f8086003b03",le="2500"} 1

In metrics output, which is I believe the worker heartbeating task queue name, but, it should not be making any poll workflow task calls. Just want to double-check that's fixed in this branch and we have a test for it.

Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind that. That's the sticky queue name and I just forgot that was the convention it followed.

cursor[bot]

This comment was marked as outdated.

heartbeat_map,
namespace,
cancel,
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Heartbeat Failure Due to Error Suppression

The SharedNamespaceWorker::new function can fail, but its error isn't propagated. This leads to a non-functional heartbeat mechanism and an inconsistent ClientWorkerRegistrator state, where heartbeat capability is indicated without a registered callback.

Additional Locations (1)

Fix in Cursor Fix in Web

@yuandrew yuandrew merged commit 9e9a461 into temporalio:master Oct 20, 2025
31 of 33 checks passed
@yuandrew yuandrew deleted the worker-heartbeat branch October 20, 2025 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants