Skip to content

Commit 21a3f31

Browse files
authored
Fix issue with high-contention mutex causing hung workflows (#309)
1 parent f3b34b4 commit 21a3f31

File tree

8 files changed

+97
-25
lines changed

8 files changed

+97
-25
lines changed

README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ opinions. Please communicate with us on [Slack](https://t.mp/slack) in the `#rub
6161
- [Manual Time Skipping](#manual-time-skipping)
6262
- [Mocking Activities](#mocking-activities)
6363
- [Workflow Replay](#workflow-replay)
64+
- [Advanced Workflow Safety and Escaping](#advanced-workflow-safety-and-escaping)
65+
- [Durable Fiber Scheduler](#durable-fiber-scheduler)
66+
- [Illegal Call Tracing](#illegal-call-tracing)
6467
- [Activities](#activities)
6568
- [Activity Definition](#activity-definition)
6669
- [Activity Context](#activity-context)
@@ -716,6 +719,11 @@ Ruby workflows. This means there are several things workflows cannot do such as:
716719
* Make any random calls
717720
* Make any not-guaranteed-deterministic calls
718721

722+
This means you can't even call `puts` or logger calls outside of `Temporalio::Workflow.logger` because they use mutexes
723+
which may be hit during periods of high-contention, but they are not completely disabled since users may do quick
724+
debugging with them. See the [Advanced Workflow Safety and Escaping](#advanced-workflow-safety-and-escaping) section if
725+
needing to work around this.
726+
719727
#### Workflow Testing
720728

721729
Workflow testing can be done in an integration-test fashion against a real server. However, it is hard to simulate
@@ -918,6 +926,46 @@ end
918926

919927
See the `WorkflowReplayer` API documentation for more details.
920928

929+
#### Advanced Workflow Safety and Escaping
930+
931+
Workflows use a custom fiber scheduler to make things like certain blocking calls and timeouts durable. There is also
932+
call tracing to prevent accidentally making illegal workflow calls. But sometimes in advanced situations, workarounds
933+
may be needed. This section describes advanced situations working with the workflow Fiber scheduler and illegal call
934+
tracer.
935+
936+
##### Durable Fiber Scheduler
937+
938+
The custom fiber scheduler that powers workflows makes otherwise-local, blocking things durable. This is why `sleep` and
939+
`Timeout.timeout` and `Queue` and other things work durably. However, there are cases where it may be desired for these
940+
to work locally inside a workflow such as for logging or `puts` or other side-effecting, known-non-deterministic
941+
aspects.
942+
943+
Users can pass a block to `Temporalio::Workflow::Unsafe.durable_scheduler_disabled` to not use the durable scheduler.
944+
This should be used any time the scheduler needs to be bypassed, e.g. for local stdout. Not doing this can cause
945+
workflows to get hung in high contention situations. For instance, if there is a `puts` or a logger (that isn't the
946+
safe-to-use `Temporalio::Workflow.logger`) in a workflow, _technically_ Ruby surrounds the IO writes with a mutex and
947+
in extreme high contention that mutex may durably block and then the workflow task may complete causing hung workflows
948+
because no event comes to wake the mutex.
949+
950+
Also, by default anything that relies on IO wait that is not inside `durable_scheduler_disabled` will fail. It is
951+
recommended to put things that need this in `durable_scheduler_disabled`, but if the durable scheduler is still needed
952+
but IO wait is also needed, then a block passed to `Temporalio::Workflow::Unsafe.io_enabled` can be used.
953+
954+
Note `durable_scheduler_disabled` implies `illegal_call_tracing_disabled` (see next section). Many use of
955+
`durable_scheduler_disabled`, such as for tracing or logging, often surround themselves in a
956+
`unless Temporalio::Workflow.replaying?` block to make sure they don't duplicate the side effects on replay.
957+
958+
##### Illegal Call Tracing
959+
960+
Ruby workflow threads employ a `TracePoint` to catch illegal calls such as `Time.now` or `Thread.new`. The set of
961+
illegal calls can be configured via the `illegal_workflow_calls` parameter when creating a worker. The default set is at
962+
`Temporalio::Worker.default_illegal_workflow_calls`.
963+
964+
When an illegal call is encountered, an exception is thrown. In advanced cases there may be a need to allow an illegal
965+
call that is known to be used deterministically. This code can be in a block passed to
966+
`Temporalio::Workflow::Unsafe.illegal_call_tracing_disabled`. If this has side-effecting behavior that needs to use the
967+
non-durable scheduler, use `durable_scheduler_disabled` instead (which implies this, see previous section).
968+
921969
### Activities
922970

923971
#### Activity Definition

temporalio/lib/temporalio/contrib/open_telemetry.rb

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -453,19 +453,15 @@ def self.completed_span(
453453
'temporalRunID' => Temporalio::Workflow.info.run_id }.merge(attributes)
454454

455455
time = Temporalio::Workflow.now.dup
456-
# Disable illegal call tracing because OTel asks for full exception message which uses error highlighting and
457-
# such which accesses File#path, and they also use loggers accessing current time
458-
Temporalio::Workflow::Unsafe.illegal_call_tracing_disabled do
459-
# Disable durable scheduler because 1) synchronous/non-batch span processors in OTel use network (though
460-
# could have just used Unafe.io_enabled for this if not for the next point) and 2) OTel uses Ruby Timeout
461-
# which we don't want to use durable timers.
462-
Temporalio::Workflow::Unsafe.durable_scheduler_disabled do
463-
span = root.tracer.start_span(name, attributes:, links:, start_timestamp: time, kind:) # steep:ignore
464-
# Record exception if present
465-
span.record_exception(exception) if exception
466-
# Finish the span (returns self)
467-
span.finish(end_timestamp: time)
468-
end
456+
# Disable durable scheduler because 1) synchronous/non-batch span processors in OTel use network (though could
457+
# have just used Unafe.io_enabled for this if not for the next point) and 2) OTel uses Ruby Timeout which we
458+
# don't want to use durable timers.
459+
Temporalio::Workflow::Unsafe.durable_scheduler_disabled do
460+
span = root.tracer.start_span(name, attributes:, links:, start_timestamp: time, kind:) # steep:ignore
461+
# Record exception if present
462+
span.record_exception(exception) if exception
463+
# Finish the span (returns self)
464+
span.finish(end_timestamp: time)
469465
end
470466
end
471467
end

temporalio/lib/temporalio/converters/payload_converter/json_plain.rb

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
require 'json'
44
require 'temporalio/api'
55
require 'temporalio/converters/payload_converter/encoding'
6+
require 'temporalio/workflow'
67

78
module Temporalio
89
module Converters
@@ -28,15 +29,31 @@ def encoding
2829

2930
# (see Encoding.to_payload)
3031
def to_payload(value, hint: nil) # rubocop:disable Lint/UnusedMethodArgument
31-
Api::Common::V1::Payload.new(
32-
metadata: { 'encoding' => ENCODING },
33-
data: JSON.generate(value, @generate_options).b
34-
)
32+
# For generate and parse, if we are in a workflow, we need to do this outside of the durable scheduler since
33+
# some things like the recent https://github.com/ruby/json/pull/832 may make illegal File.expand_path calls.
34+
# And other future things may be slightly illegal in JSON generate/parse and we don't want to break everyone
35+
# when it happens.
36+
data = if Temporalio::Workflow.in_workflow?
37+
Temporalio::Workflow::Unsafe.durable_scheduler_disabled do
38+
JSON.generate(value, @generate_options).b
39+
end
40+
else
41+
JSON.generate(value, @generate_options).b
42+
end
43+
44+
Api::Common::V1::Payload.new(metadata: { 'encoding' => ENCODING }, data:)
3545
end
3646

3747
# (see Encoding.from_payload)
3848
def from_payload(payload, hint: nil) # rubocop:disable Lint/UnusedMethodArgument
39-
JSON.parse(payload.data, @parse_options)
49+
# See comment in to_payload about why we have to do something different in workflow
50+
if Temporalio::Workflow.in_workflow?
51+
Temporalio::Workflow::Unsafe.durable_scheduler_disabled do
52+
JSON.parse(payload.data, @parse_options)
53+
end
54+
else
55+
JSON.parse(payload.data, @parse_options)
56+
end
4057
end
4158
end
4259
end

temporalio/lib/temporalio/internal/worker/workflow_instance/context.rb

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,12 +63,13 @@ def deprecate_patch(patch_id)
6363
end
6464

6565
def durable_scheduler_disabled(&)
66-
prev = Fiber.current_scheduler
67-
illegal_call_tracing_disabled { Fiber.set_scheduler(nil) }
68-
begin
66+
prev = Fiber.scheduler
67+
# Imply illegal call tracing disabled
68+
illegal_call_tracing_disabled do
69+
Fiber.set_scheduler(nil)
6970
yield
7071
ensure
71-
illegal_call_tracing_disabled { Fiber.set_scheduler(prev) }
72+
Fiber.set_scheduler(prev)
7273
end
7374
end
7475

temporalio/lib/temporalio/internal/worker/workflow_instance/replay_safe_logger.rb

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,11 @@ def add(...)
2727
return true
2828
end
2929

30-
# Disable illegal call tracing for the log call
31-
@instance.illegal_call_tracing_disabled { super }
30+
# Disable scheduler since logs technically have local mutexes in them that cannot be done durably or they
31+
# will block workflows
32+
@instance.context.durable_scheduler_disabled do
33+
super
34+
end
3235
end
3336
end
3437
end

temporalio/lib/temporalio/internal/worker/workflow_instance/scheduler.rb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,8 @@ def io_wait(io, events, timeout)
141141
unless @instance.io_enabled
142142
raise Workflow::NondeterminismError,
143143
'Cannot perform IO from inside a workflow. If this is known to be safe, ' \
144-
'the code can be run in a Temporalio::Workflow::Unsafe.io_enabled block.'
144+
'the code can be run in a Temporalio::Workflow::Unsafe.durable_scheduler_disabled ' \
145+
'or Temporalio::Workflow::Unsafe.io_enabled block.'
145146
end
146147

147148
# Use regular Ruby behavior of blocking this thread. There is no Ruby implementation of io_wait we can just

temporalio/lib/temporalio/workflow.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -555,6 +555,8 @@ def self.io_enabled(&)
555555
# fiber scheduler and no workflow helpers will be available in the block. This is usually only needed in advanced
556556
# situations where a third party library does something like use "Timeout" in a way that shouldn't be made
557557
# durable.
558+
#
559+
# This implies {illegal_call_tracing_disabled}.
558560
def self.durable_scheduler_disabled(&)
559561
Workflow._current.durable_scheduler_disabled(&)
560562
end

temporalio/test/worker_workflow_test.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -571,6 +571,10 @@ def execute
571571
def test_deadlock
572572
# TODO(cretz): Do we need more tests? This attempts to interrupt the workflow via a raise on the thread, but do we
573573
# need to concern ourselves with what happens if that's accidentally swallowed?
574+
# TODO(cretz): Note that often mutexes and cond vars are not subject to Timeout.timeout which means they can not be
575+
# interrupted by deadlock detection
576+
# TODO(cretz): Note that a thread.join that does get deadlock detected may crash the VM on exit with a
577+
# "[BUG] pthread_mutex_lock: Invalid argument (EINVAL)"
574578
# TODO(cretz): Decrease deadlock detection timeout to make test faster? It is 4s now because shutdown waits on
575579
# second task.
576580
execute_workflow(DeadlockWorkflow) do |handle|

0 commit comments

Comments
 (0)