Skip to content

SLEP 014 Pandas in Pandas out #37

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 29, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
173 changes: 173 additions & 0 deletions slep014/proposal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
.. _slep_014:

==============================
SLEP014: Pandas In, Pandas Out
==============================

:Author: Thomas J Fan
:Status: Under Review
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go for SLEP000, this would be a Draft

:Type: Standards Track
:Created: 2020-02-18

Abstract
########

This SLEP proposes using pandas DataFrames for propagating feature names
through ``scikit-learn`` estimators.

Motivation
##########

``scikit-learn`` is generally used as a part of a larger data processing
pipeline. When this pipeline is used to transform data, the result is a
NumPy array, which discards column names. The current workflow for
extracting the feature names requires calling ``get_feature_names`` on the
transformer that created the feature. This interface can be cumbersome when used
together with a pipeline with multiple column names::

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

X = pd.DataFrame({'letter': ['a', 'b', 'c'],
'pet': ['dog', 'snake', 'dog'],
'num': [1, 2, 3]})
y = [0, 0, 1]
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']

ct = make_column_transformer(
(OneHotEncoder(), orig_cat_cols), (StandardScaler(), orig_num_cols))
pipe = make_pipeline(ct, LogisticRegression()).fit(X,y)

cat_names = (pipe['columntransformer']
.named_transformers_['onehotencoder']
.get_feature_names(orig_cat_cols))

feature_names = np.r_[cat_names, orig_num_cols]

The ``feature_names`` extracted above corresponds to the features directly
passed into ``LogisticRegression``. As demonstrated above, the process of
extracting ``feature_names`` requires knowing the order of the selected
categories in the ``ColumnTransformer``. Furthemore, if there is feature
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
would need to be used to select column names that were selected.

Solution
########

The pandas ``DataFrame`` has been widely adopted by the Python Data ecosystem to
store data with feature names. This SLEP proposes using a ``DataFrame`` to
track the feature names as the data is transformed. With this feature, the
API for extracting feature names would be::

from sklearn import set_config
set_config(pandas_inout=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit, but I think pandas_in_out might be more readable


pipe.fit(X, y)
X_trans = pipe[:-1].transform(X)

print(X_trans.columns.tolist()
['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num']

Enabling Functionality
######################

The following enhancements are **not** a part of this SLEP. These features are
made possible if this SLEP gets accepted.

1. Allows estimators to treat columns differently based on name or dtype. For
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way is this enabled by the present SLEP? I assume this means something more expansive: that we will try to retain dtype in outputting a dataframe e.g. after feature selection.

Otherwise this pertains to the handling of Pandas input, which is done on a case-by-case basis already?

example, the categorical dtype is useful for tree building algorithms.

2. Storing feature names inside estimators for model inspection::

from sklearn import set_config
set_config(store_feature_names_in=True)

pipe.fit(X, y)

pipe['logisticregression'].feature_names_in_

3. Allow for extracting the feature names of estimators in meta-estimators::

from sklearn import set_config
set_config(store_feature_names_in=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should have the default values of these configs somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of pandas_inout is now stated twice in the Solution and Backward compatibility sections. The default value of store_feature_names_in is started in its section. I purposefully did not go into too many details in the Enabling Functionality section, since it serves as "what are the possibilities if this SLEP gets accepted".


est = BaggingClassifier(LogisticRegression())
est.fit(X, y)

# Gets the feature names used by an estimator in the ensemble
est.estimators_[0].feature_names_in_

Considerations
##############

Index alignment
---------------

Operations are index aligned when working with ``DataFrames``. Interally,
``scikit-learn`` will ignore the alignment by operating on the ndarray as
suggested by `TomAugspurger <https://github.com/scikit-learn/enhancement_proposals/pull/25#issuecomment-573859151>`_::

def transform(self, X, y=None):
X, row_labels, input_type = check_array(X)
# X is a ndarray
result = ...
# some hypothetical function that recreates a DataFrame / DataArray,
# preserving row labels, attaching new features names.
return construct_result(result, output_feature_names, row_labels, input_type)

Memory copies
-------------

As noted in `pandas #27211 <https://github.com/pandas-dev/pandas/issues/27211>`_,
there is not a guarantee that there is a zero-copy round-trip going from numpy
to a ``DataFrame``. In other words, the following may lead to a memory copy in
a future version of ``pandas``::

X = np.array(...)
X_df = pd.DataFrame(X)
X_again = np.asarray(X_df)

This is an issue for ``scikit-learn`` when estimators are placed into a
pipeline. For example, consider the following pipeline::

set_config(pandas_inout=True)
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X, y)

Interally, ``StandardScaler.fit_transform`` will operate on a ndarray and
wrap the ndarray into a ``DataFrame`` as a return value. This is will be
piped into ``LogisticRegression.fit`` which calls ``check_array`` on the
``DataFrame``, which may lead to a memory copy in a future version of
``pandas``. This leads to unnecessary overhead from piping the data from one
estimator to another.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think that for some transformers (like StandardScaler) could rather easily work column-wise to avoid such copying overhead.
(of course, that will give a stronger pandas dependence also for the implementation of such transformers)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it could even have the option of being "in-place" :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to support this. check_array would need to not run asarray on the dataframe, and the transformer would need to operate on the dataframe itself.

Copy link
Member

@amueller amueller Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes check_array could get another option! exciting! ;) [I agree with @jorisvandenbossche though that this would be interesting for the future].


Backward compatibility
######################

The ``set_config(pandas_inout=True)`` global configuration flag will be set to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more like pandas_output?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had something like that initially, but it feels like "We will always output pandas, even if the input is numpy arrays". (Naming is hard)

``False`` by default to ensure backward compatibility. When this flag is False,
the output of all estimators will be a ndarray.

Alternatives
############

- :ref:`SLEP012 Custom InputArray Data Structure <slep_012>`

References and Footnotes
------------------------

.. [1] Each SLEP must either be explicitly labeled as placed in the public
domain (see this SLEP as an example) or licensed under the `Open
Publication License`_.

.. _Open Publication License: https://www.opencontent.org/openpub/


Copyright
---------

This document has been placed in the public domain. [1]_
1 change: 1 addition & 0 deletions under_review.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ SLEPs under review
slep007/proposal
slep012/proposal
slep013/proposal
slep014/proposal