Skip to content

[enhancement] Enable Array API in ensemble algos#2201

Merged
icfaust merged 215 commits intouxlfoundation:mainfrom
icfaust:dev/new_RF
Dec 8, 2025
Merged

[enhancement] Enable Array API in ensemble algos#2201
icfaust merged 215 commits intouxlfoundation:mainfrom
icfaust:dev/new_RF

Conversation

@icfaust
Copy link
Copy Markdown
Contributor

@icfaust icfaust commented Dec 2, 2024

Description

This PR refactors the Ensemble algorithms (RandomForestRegressor, RandomForestClassifier, ExtraTreesRegressor and ExtraTreesClassifier) to follow repository standards and add array API support. This reduced the code by 500+ lines and required the following changes:

  • Remove BaseEstimator inheritance from onedal ensemble estimators
  • Change estimator __init__ signatures to remove sklearn conformant kwargs in onedal ensemble estimators
  • Inline code comments added for function of various aspects for future maintenance
  • Remove random_state use from onedal estimators
  • Add class_count kwarg to fit as calculating it in python is scikit-learn conformance (oneDAL expects it a priori)
  • Remove input parameter checks from the onedal estimators
  • generalize return of out of bag values from oneDAL for use by Classifiers and Regressors
  • Remove unused _create_model function
  • Centralized predict method
  • Create ForestRegressor and ForestClasssifier objects to minimize maintenance
  • swap away from max_samples to observations_per_tree_fraction to follow oneDAL values
  • Modify tests for onedal to use numpy arrays (which can be consumed, where lists cannot)
  • Reorder warnings and errors based on type (e.g. parameter checks vs input checks etc.)
  • Refactor _save_attributes method to be specific to Classifiers vs Regressors
  • Refactor _onedal_fit_ready, _onedal_cpu_supported and _onedal_gpu_supported to reduce code duplication via inheritance and make array API enabled
  • Add enable_array_api decorators to public-facing estimators
  • Place _check_parameters function behind sklearn_check_version for future removal
  • Remove check for min_impurity_split which was removed in sklearn 0.25
  • Add array API-enabled _validate_y_class_weight method designed specifically for sklearnex estimators (missing some functionality which is irrelevant to the sklearnex estimator)
  • Remove check_n_features from sklearnex.utils.validation as it is no longer necessary
  • Enable weighted fitting support for gpu
  • Removed sample_weight checks for sparsity (blocked by _check_sample_weight)
  • Added documentation to the nature of set attributes and array API support

PR should start as a draft, then move to ready for review state after CI is passed and all applicable checkboxes are closed.
This approach ensures that reviewers don't spend extra time asking for regular requirements.

You can remove a checkbox as not applicable only if it doesn't relate to this PR in any way.
For example, PR with docs update doesn't require checkboxes for performance while PR with any change in actual code should have checkboxes and justify how this code change is expected to affect performance (or justification should be self-evident).

Checklist to comply with before moving PR from draft:

PR completeness and readability

  • I have reviewed my changes thoroughly before submitting this pull request.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have added a respective label(s) to PR if I have a permission for that.
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@ethanglaser
Copy link
Copy Markdown
Contributor

/intelci: run

@icfaust
Copy link
Copy Markdown
Contributor Author

icfaust commented Dec 6, 2025

/intelci: run

@icfaust icfaust added enhancement New feature or request Array API labels Dec 6, 2025
@icfaust
Copy link
Copy Markdown
Contributor Author

icfaust commented Dec 6, 2025

/intelci: run

@icfaust
Copy link
Copy Markdown
Contributor Author

icfaust commented Dec 7, 2025

/intelci: run

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sklearnex/ensemble/_forest.py
Comment thread onedal/ensemble/forest.py
Comment thread onedal/ensemble/forest.py
Comment on lines +692 to +693
for i, v in enumerate(class_weights):
expanded_class_weight[y_store_unique_indices == i] *= v
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 688 warns about O(n*m) complexity. This nested iteration over classes and samples could be a significant performance bottleneck for datasets with many classes. Consider adding a more explicit warning in the docstring or raising a warning at runtime when the number of classes exceeds a threshold (e.g., >100).

Copilot uses AI. Check for mistakes.
dtype=[xp.float64, xp.float32],
ensure_all_finite=not sklearn_check_version(
"1.4"
), # completed in offload check
Copy link

Copilot AI Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment 'completed in offload check' is unclear about where and how the finite check is completed. This should reference the specific location (e.g., line numbers or function name) where the check occurs to aid future maintenance.

Suggested change
), # completed in offload check
), # finite check is performed in support_input_format() in onedal._device_offload

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sklearnex/ensemble/_forest.py
Comment thread sklearnex/ensemble/_forest.py
Comment thread sklearnex/ensemble/_forest.py
Comment thread sklearnex/ensemble/_forest.py
Comment thread onedal/ensemble/forest.py
Comment thread onedal/ensemble/tests/test_random_forest.py
@icfaust
Copy link
Copy Markdown
Contributor Author

icfaust commented Dec 7, 2025

/intelci: run

1 similar comment
@icfaust
Copy link
Copy Markdown
Contributor Author

icfaust commented Dec 7, 2025

/intelci: run

@icfaust
Copy link
Copy Markdown
Contributor Author

icfaust commented Dec 7, 2025

Comment thread deselected_tests.yaml
- tests/test_common.py::test_estimators[ExtraTreesClassifier()-check_sample_weights_invariance(kind=ones)]
- tests/test_common.py::test_estimators[ExtraTreesClassifier()-check_sample_weights_invariance(kind=zeros)]
- tests/test_common.py::test_estimators[ExtraTreesRegressor()-check_sample_weights_invariance(kind=ones)]
- ensemble/tests/test_forest.py::test_min_weight_fraction_leaf
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @Alexandr-Solovev - this test in particular is very straighforward and not expected to fail, yet it does here.

@icfaust icfaust merged commit d769d14 into uxlfoundation:main Dec 8, 2025
31 checks passed
david-cortes-intel added a commit to david-cortes-intel/scikit-learn-intelex that referenced this pull request Dec 10, 2025
* add finiteness_checker pybind11 bindings

* added finiteness checker

* Update finiteness_checker.cpp

* Update finiteness_checker.cpp

* Update finiteness_checker.cpp

* Update finiteness_checker.cpp

* Update finiteness_checker.cpp

* Update finiteness_checker.cpp

* Rename finiteness_checker.cpp to finiteness_checker.cpp

* Update finiteness_checker.cpp

* add next step

* follow conventions

* make xtable explicit

* remove comment

* Update validation.py

* Update __init__.py

* Update validation.py

* Update __init__.py

* Update __init__.py

* Update validation.py

* Update _data_conversion.py

* Update _data_conversion.py

* Update policy_common.cpp

* Update policy_common.cpp

* Update _policy.py

* Update policy_common.cpp

* Rename finiteness_checker.cpp to finiteness_checker.cpp

* Create finiteness_checker.py

* Update validation.py

* Update __init__.py

* attempt at fixing circular imports again

* fix isort

* remove __init__ changes

* last move

* Update policy_common.cpp

* Update policy_common.cpp

* Update policy_common.cpp

* Update policy_common.cpp

* Update validation.py

* add testing

* isort

* attempt to fix module error

* add fptype

* fix typo

* Update validation.py

* remove sua_ifcae from to_table

* isort and black

* Update test_memory_usage.py

* format

* Update _data_conversion.py

* Update _data_conversion.py

* Update test_validation.py

* remove unnecessary code

* make reviewer changes

* make dtype check change

* add sparse testing

* try again

* try again

* try again

* temporary commit

* first attempt

* missing change?

* modify DummyEstimator for testing

* generalize DummyEstimator

* switch test

* further testing changes

* add initial validate_data test, will be refactored

* fixes for CI

* Update validation.py

* Update validation.py

* Update test_memory_usage.py

* Update base.py

* Update base.py

* improve tests

* fix logic

* fix logic

* fix logic again

* rename file

* Revert "rename file"

This reverts commit 8d47744.

* remove duplication

* fix imports

* Rename test_finite.py to test_validation.py

* Revert "Rename test_finite.py to test_validation.py"

This reverts commit ee799f6.

* updates

* Update validation.py

* fixes for some test failures

* fix text

* fixes for some failures

* make consistent

* fix bad logic

* fix in string

* attempt tp see if dataframe conversion is causing the issue

* fix iter problem

* fix testing issues

* formatting

* revert change

* fixes for pandas

* there is a slowdown with pandas that needs to be solved

* swap to transpose for speed

* more clarity

* add _check_sample_weight

* add more testing'

* rename

* remove unnecessary imports

* fix test slowness

* focus get_dataframes_and_queues

* put config_context around

* Update test_validation.py

* Update base.py

* Update test_validation.py

* generalize regex

* add fixes for sklearn 1.0 and input_name

* fixes for test failures

* Update validation.py

* Update test_validation.py

* Update validation.py

* formattintg

* make suggested changes

* follow changes made in uxlfoundation#2126

* fix future device problem

* Update validation.py

* finished movement

* fix first error

* next mistake

* remove bad dtypes check

* updates

* remove array

* solve onedal issues

* solve onedal issues

* updates

* updates

* further fixes

* further fixes

* fix issues to see how it goes

* oops

* updates

* add finite checks for predict and predict_proba

* updates

* centralize

* further reduce code

* updates

* remove sklearn conformance from onedal estimator init signature

* remove more

* fixes

* change away from sklearn `max_samples` in onedal estimators

* fix error

* move things

* Update forest.py

* Update forest.py

* Update _forest.py

* further fixes to onedal side

* further fixes to onedal side

* simplifications

* attempt at classifiers support

* further changes

* fix error on onedal side

* fix error on onedal side

* fixes

* fix pandas related error

* remove unnecessary code:

* try to fix issues related to regressor data

* fixes necessary for CI

* fixes for formatting

* updates

* push

* push

* fixes

* remove upon request

* remove upon request

* further fixes

* try to fix classifiers for array API inputs

* try again

* Update array_api.rst

* Update sklearnex/ensemble/_forest.py

Co-authored-by: david-cortes-intel <david.cortes@intel.com>

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update _forest.py

* Update forest.py

* Update _forest.py

* Update sklearnex/ensemble/_forest.py

Co-authored-by: ethanglaser <42726565+ethanglaser@users.noreply.github.com>

* Update _forest.py

* Update _forest.py

* Update array_api.rst

* Update array_api.rst

* remove sparse checks for sample_weight

* Update deselected_tests.yaml

* Update deselected_tests.yaml

---------

Co-authored-by: david-cortes-intel <david.cortes@intel.com>
Co-authored-by: ethanglaser <42726565+ethanglaser@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Array API enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants