Skip to content

[SPARK-51880][ML][PYTHON][CONNECT] Avoid eager model removal in meta algorithms when collectSubModel is true #50682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Apr 23, 2025

What changes were proposed in this pull request?

Avoid eager model removal in meta algorithms when collectSubModel is true

No matter on classic mode or connect mode, no matter collectSubModel is true or false, __del__ of models and estimators are always invoked in _parallelFitTasks.

There seems to be an internal copy, I add log in __del__ to print the address of objects id(self)/id(self._java_obj) to be deleted and find that the ids are different from these from final model.subModels.

That is to say, internal copying and removal of model/estimator happen in _parallelFitTasks.

metric = eva.evaluate(model.transform(validation, epm[index]))

trigger

if isinstance(params, dict):
if params:
return self.copy(params)._transform(dataset)
else:
return self._transform(dataset)

so in this evaluate:
1, copy the model (shares the same ref_id)
2, transform and compute the metric;
3, delete the copied model

Why are the changes needed?

for feature parity

Does this PR introduce any user-facing change?

yes, bug-fix

How was this patch tested?

enabled tests

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng marked this pull request as draft April 23, 2025 14:56
@zhengruifeng zhengruifeng changed the title [SPARK-51880][ML][PYTHON][CONNECT] Avoid eager model removal in meta algorithms when collectSubModel is true [WIP][SPARK-51880][ML][PYTHON][CONNECT] Avoid eager model removal in meta algorithms when collectSubModel is true Apr 23, 2025
@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Apr 25, 2025

I suggest to use reference count in client side to make a thorough fix, pseudo code:

class RefId:
   ref_count = 1
   
   def add_ref(self):
      self.ref_count += 1
   
   def release(self):
      self.ref_count -= 1
      if self.ref_count == 0:
          # request server to delete model 
          ...

class Model:
   def __del__(self):
      self.ref_id.release()

   def copy(self):
      self.ref_id.add_ref()
      return new Model(self.ref_id)

Comment on lines 117 to 121
if collectSubModel and is_remote():
# In remote mode, we need to explicitly disable the __del__ which
# seems to be triggered inside this thread.
tl = train._session.client.thread_local # type: ignore[union-attr, operator]
tl.disable_ml_del = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a fix for CrossValidator.

If user uses model.copy , it also triggers the error .

@zhengruifeng zhengruifeng changed the title [WIP][SPARK-51880][ML][PYTHON][CONNECT] Avoid eager model removal in meta algorithms when collectSubModel is true [SPARK-51880][ML][PYTHON][CONNECT] Avoid eager model removal in meta algorithms when collectSubModel is true Apr 25, 2025
@zhengruifeng zhengruifeng marked this pull request as ready for review April 25, 2025 03:16
@zhengruifeng zhengruifeng marked this pull request as draft April 25, 2025 04:53
@WeichenXu123
Copy link
Contributor

I filed #50707

@zhengruifeng
Copy link
Contributor Author

close in favor of #50707

@zhengruifeng zhengruifeng deleted the test_propagate branch April 25, 2025 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants