Skip to content

[Feature Request] Expose SDK metric for worker._count_not_evict_count #875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
millerick opened this issue May 20, 2025 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@millerick
Copy link

Is your feature request related to a problem? Please describe.

We have found that periodically (for reasons that we still need to root cause) our workers run into a series of Failed running eviction job for run ID 0196d798-a08b-7a00-9082-353865f449b4, continually retrying eviction. Since eviction could not be processed, this worker may not complete and the slot may remain forever used unless it eventually completes. errors. Then hours later when the pod containing the worker is terminated, we see this log: Shutting down workflow worker, but 46 workflow(s) could not be evicted previously, so the shutdown may hang. For this particular worker, we run 50 concurrent workflows, which if I interpret things correct means that for several hours the worker was in an infinite loop trying to allow 46 workflows to evict and only able to process 4 workflow tasks at a time.

We would like to be able to detect and alert on these situations more proactively. Usually we end up finding out about them because the worker set scales up to the maximum number of replicas for an extended period of time.

Describe the solution you'd like

Since the code already keeps track of when it is in its own infinite loop trying to process the eviction, I think it would be useful to expose that information as a metric so that alerting tools can be used to alert when pods have been in that state for whatever the team monitoring the metric determines to be "too long".

Additional context

If the team is bold enough, it could also be nice to do one or more of the following:

  1. Provide a setting that forces the worker to shutdown if it has been in an eviction loop for too long.
  2. Provide more threads than max_concurrent_workflow_tasks so that the ability to process workflows isn't as likely to be impeded by the infinite eviction loop.
@millerick millerick added the enhancement New feature or request label May 20, 2025
@cretz
Copy link
Member

cretz commented May 27, 2025

for reasons that we still need to root cause

Hrmm, this is a fairly advanced/rare situation and should only occur if something is written improperly or something else unexpected is happening. We do prefer to expose these rare situations as logs instead of metrics. All SDK metrics are Core/Rust-based and we have no pure Python or Python-SDK-specific metrics.

Is it possible to use logs as the exposure mechanism? There are lots of other rare places where problems can occur that we log and do not treat as metrics or worker-shutdown-able situations. I will confer with team on their thoughts here for this specific situation.

I think you may be able to use temporal_worker_task_slots_available and temporal_worker_task_slots_used to determine if slots are never returned. Only kinda though, because metrics do not give the information that logs do.

@millerick
Copy link
Author

I think you may be able to use temporal_worker_task_slots_available and temporal_worker_task_slots_used to determine if slots are never returned. Only kinda though, because metrics do not give the information that logs do.

Yes, using the temporal_worker_task_slots_used metric is less direct, but it may be usable for this purpose. We have yet to run into a situation where all of the worker task slots are reported as used for legitimate reasons.

@cretz
Copy link
Member

cretz commented May 27, 2025

👍 After conferring, this may be a useful metric, but it may be a bit of a heavy lift for language-specific metrics in our Core-based SDKs for this type of metric. We will leave the issue open, but adding this metric may not be a priority for us at this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants