Skip to content

Conversation

@cretz
Copy link
Member

@cretz cretz commented Aug 5, 2025

What was changed

The main issue is there was a case where, in high-contention situations, a logger's mutex will block for a short little bit. Since we make mutexes durable, we completed the workflow task while the mutex was waiting and never resume the workflow to wake it up. We've made a couple of changes to help here.

Code to replicate
# frozen_string_literal: true

require 'logger'
require 'temporalio/client'
require 'temporalio/testing'
require 'temporalio/worker'

workflow_count = 500

logger = Logger.new($stdout)

class HangingTimersWorkflow < Temporalio::Workflow::Definition
  workflow_query_attr_reader :done

  def execute
    Temporalio::Workflow.logger.info 'Started demo timers workflow'
    5.times.each do |i|
      Temporalio::Workflow.logger.info("Timer fut #{i}")
      timer_fut(seconds: 1).wait
    end
    Temporalio::Workflow.logger.info 'Finished demo timers workflow'
  end

  def timer_fut(seconds:)
    Temporalio::Workflow::Future.new do
      Temporalio::Workflow.logger.info 'Timer starting'
      Temporalio::Workflow.sleep(seconds)
      Temporalio::Workflow.logger.info 'Timer finished'
    end.tap do
      # Temporalio::Workflow.logger.info "Timer created"
    end
  end
end

logger.info('Starting dev server...')
Temporalio::Testing::WorkflowEnvironment.start_local(logger:, ui: true, ui_port: 8233) do |env|
  # Start them all
  logger.info("Starting #{workflow_count} workflows...")
  handles = workflow_count.times.map do |i|
    env.client.start_workflow(HangingTimersWorkflow, id: "wf-#{i}", task_queue: 'default')
  end

  # Run worker
  logger.info('Starting worker...')
  worker = Temporalio::Worker.new(client: env.client, task_queue: 'default', workflows: [HangingTimersWorkflow])
  worker.run(shutdown_signals: ['SIGINT']) do
    # Every couple of seconds, let's describe all handles until they are all done
    logger.info('Now waiting')
    loop do
      logger.info('Tick')
      sleep(2)
      running_count = handles.count { |h| h.describe.close_time.nil? }
      logger.info("Of the original #{workflow_count}, #{running_count} still running")
      break if running_count.zero?
    end
  end
ensure
  logger.info('Stopping, press Enter to exit...')
  STDIN.gets
end

Changes:

  • Use system/default scheduler for logging calls (this is the primary fix for the bug)
  • Make durable_scheduler_disabled imply illegal_call_tracing_disabled and update usages accordingly
  • Add note in README for workflow logic constraints saying that technically puts or non-workflow logger use at high-contention can cause hung workflows
  • Add big sections to README to explain durable scheduler and illegal call tracing and how to work around them
  • Updated JSON converter to skip scheduler during conversion
    • This was actually unrelated to this issue, but as of a recent change at Fix duplicated key warning location ruby/json#832, the what-we-consider-illegal File.expand_path is now invoked at runtime during deprecation warning. So we decided that JSON conversion and any future stdlib changes inside it can skip the illegal call checks

@cretz cretz requested a review from a team as a code owner August 5, 2025 19:21
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

Comment on lines 66 to 72
prev = Fiber.current_scheduler
illegal_call_tracing_disabled { Fiber.set_scheduler(nil) }
begin
# Imply illegal call tracing disabled
illegal_call_tracing_disabled do
Fiber.set_scheduler(nil)
yield
ensure
illegal_call_tracing_disabled { Fiber.set_scheduler(prev) }
Fiber.set_scheduler(prev)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did any semantics actually change here or is this just different syntax?

Copy link
Member Author

@cretz cretz Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this changes durable_scheduler_disabled to now perform the user block (i.e. the yield) inside of illegal_call_tracing_disabled instead of before where it was only the fiber scheduler get/set that was in there. It was decided to make durable_scheduler_disabled imply illegal_call_tracing_disabled because if you use the former, you almost always want the latter.

@cretz
Copy link
Member Author

cretz commented Aug 6, 2025

Had to update our JSON converter to skip scheduler when being used, see latest bullet added to description. Feel free to re-review, will merge early next week. There is going to be a soon followup with #310 anyways.

@Sushisource
Copy link
Member

Had to update our JSON converter to skip scheduler when being used, see latest bullet added to description. Feel free to re-review, will merge early next week. There is going to be a soon followup with #310 anyways.

Makes sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants