Skip to content

[CherryPick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs#43147

Merged
aslonnie merged 1 commit intoray-project:releases/2.9.3from
alexeykudinkin:ak/cp-flr-log-fix
Feb 14, 2024
Merged

[CherryPick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs#43147
aslonnie merged 1 commit intoray-project:releases/2.9.3from
alexeykudinkin:ak/cp-flr-log-fix

Conversation

@alexeykudinkin
Copy link
Contributor

…troubleshooted

Why are these changes needed?

NOTE: This is a cherry-pick of #43111 for 2.9.3

Currently, we observe a lot of failures like following in our production deployment:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/serve/handle.py", line 781, in __anext__
    return await next_obj_ref
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

However, we can't find any logs in Ray Core corresponding to this failure. Checking around i've realized that all of the log statements we have are DEBUG logs, which necessitates us to switch to DEBUG mode which will drown our logging infra.

Hence bumping failure logs to WARNING at least to make sure any failures are traceable in Ray Core logs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rynewang
Copy link
Contributor

@zhe-thoughts Could you review this cherry pick and approve it? Thanks

@rynewang
Copy link
Contributor

Does not build. I noticed there's difference between this and 149e400 in direct_task_transport.cc. Please fix this @alexeykudinkin

src/ray/core_worker/transport/direct_task_transport.cc: In member function 'void ray::core::CoreWorkerDirectTaskSubmitter::HandleGetTaskFailureCause(const ray::Status&, bool, const ray::TaskID&, const ray::rpc::WorkerAddress&, const ray::Status&, const ray::rpc::GetTaskFailureCauseReply&)':
--
  | 2024-02-13 14:45:04 PST | src/ray/core_worker/transport/direct_task_transport.cc:715:56: error: no match for call to '(const ray::NodeID) ()'
  | 2024-02-13 14:45:04 PST | 715 \|                      << " node id: " << addr.raylet_id() << " ip: " << addr.ip_address();
  | 2024-02-13 14:45:04 PST | \|                                                        ^
  | 2024-02-13 14:45:04 PST | src/ray/core_worker/transport/direct_task_transport.cc:715:88: error: no match for call to '(const string {aka const std::basic_string<char>}) ()'
  | 2024-02-13 14:45:04 PST | 715 \|                      << " node id: " << addr.raylet_id() << " ip: " << addr.ip_address();
  | 2024-02-13 14:45:04 PST | \|

…troubleshooted

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin changed the base branch from releases/2.9.2 to releases/2.9.3 February 13, 2024 23:37
@rynewang
Copy link
Contributor

LGTM, will pull after premerge done

@alexeykudinkin alexeykudinkin changed the title [Cherry-pick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs [CherryPick][Core] Bumping up task failure logs to warnings to make sure failures could be traced in Ray Core logs Feb 14, 2024
@aslonnie aslonnie merged commit 2613d7d into ray-project:releases/2.9.3 Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants