Skip to content

[Serve] fix _to_object_ref memory leak#43763

Merged
edoakes merged 23 commits intoray-project:masterfrom
GeneDer:fix-to-object-ref-memory-leak
Mar 28, 2024
Merged

[Serve] fix _to_object_ref memory leak#43763
edoakes merged 23 commits intoray-project:masterfrom
GeneDer:fix-to-object-ref-memory-leak

Conversation

@GeneDer
Copy link
Member

@GeneDer GeneDer commented Mar 6, 2024

Why are these changes needed?

_PyObjScanner/ CloudPickler is designed to pin object refs into memory so they don't escape the scope of existing Ray session. This PR allows ObjectRef and ObjectRefGenerator to also escape the Ray memory pin so they won't cause memory leaks.

Related issue number

Closes #43248

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

GeneDer added 19 commits March 6, 2024 15:11
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
Signed-off-by: Gene Su <e870252314@gmail.com>
@GeneDer
Copy link
Member Author

GeneDer commented Mar 27, 2024

Also sanity checked on workspace and seeing the fixed worked 😄

After the fix
image

Before the fix
image

@GeneDer GeneDer marked this pull request as ready for review March 27, 2024 20:47
@GeneDer GeneDer requested a review from a team March 27, 2024 20:47
@shrekris-anyscale
Copy link
Contributor

Nice work tracking this down! What was the root cause of the leak? Why were ObjectRefs getting pinned by the CloudPickler in the first place?

Signed-off-by: Gene Su <e870252314@gmail.com>
@shrekris-anyscale
Copy link
Contributor

Ah never mind, I took a closer look at the initial issue. Looks like the workload involved explicitly passing ObjectRefs around.

@GeneDer
Copy link
Member Author

GeneDer commented Mar 27, 2024

Credit to @edoakes! Couldn't have found the root cause without pairing 😄

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Gene Der Su <gdsu@ucdavis.edu>
@edoakes
Copy link
Collaborator

edoakes commented Mar 27, 2024

Nice work tracking this down! What was the root cause of the leak? Why were ObjectRefs getting pinned by the CloudPickler in the first place?

It's because when we actually cloudpickle object refs, we want to pin them (because we don't know their lifecycle anymore). The _PyObjScanner uses the cloudpickler so as a side effect it was also pinning 🤦. We avoid this by not calling into the CloudPickler for object refs by adding it to the target type.

Signed-off-by: Gene Su <e870252314@gmail.com>
@GeneDer
Copy link
Member Author

GeneDer commented Mar 28, 2024

@edoakes this is ready for merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve] Memory leak when using Ray Serve

3 participants