[FIX] Enable preprocessing in reg_cocktails #369

ravinkohli · 2022-01-25T14:27:38Z

Types of changes

Breaking change (fix or feature that would cause existing functionality to not work as expected)

Note that a Pull Request should only contain one of refactoring, new features or documentation changes.
Please separate these changes and send us individual PRs for each.
For more information on how to create a good pull request, please refer to The anatomy of a perfect pull request.

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

Have you checked to ensure there aren't other open Pull Requests for the same update/change?
Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

Description

Following the discussion in last weeks meeting, we would like to have preprocessing as part of the pipeline in regularisation cocktails. Therefore this PR makes the necessary changes to achieve that. These include uncommenting the various preprocessor transforms as well as code for learning embeddings. Moreover, this PR also removes the is_small_preprocess attribute from BaseDataset as we will always preprocess early in the EarlyPreprocessing node.

Motivation and Context

It is required to allow searching for preprocessing hyperparameters in the reg_cocktails branch. Also, to avoid redundant preprocessing of the dataset for each epoch, we would like to preprocess the whole data all at once.

How has this been tested?

Previous tests for preprocessing and early preprocessing have been enabled. Also tests for embedding module have been enabled to check if embedding is compatible with the preprocessing approach

nabenabe0928

Thanks for the PR, I reviewed all the files except tabular_feature_validator.py related (two) files.

test/test_pipeline/components/preprocessing/test_scalers.py

autoPyTorch/pipeline/components/training/data_loader/image_data_loader.py

autoPyTorch/pipeline/components/training/data_loader/feature_data_loader.py

autoPyTorch/pipeline/components/training/data_loader/base_data_loader.py

autoPyTorch/pipeline/components/setup/network_backbone/base_network_backbone.py

autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py

autoPyTorch/pipeline/components/setup/early_preprocessor/EarlyPreprocessing.py

nabenabe0928

Add some comments on tabular_feature_validator.

autoPyTorch/data/tabular_feature_validator.py

nabenabe0928

I checked the tabular_feature_validator

autoPyTorch/data/tabular_feature_validator.py

nabenabe0928 · 2022-01-28T05:46:52Z

autoPyTorch/data/tabular_feature_validator.py

+                    # columns are shifted to the left
+                    list(range(len(cat)))
+                    for cat in encoded_categories
+                ]

            # differently to categorical_columns and numerical_columns,
            # this saves the index of the column.


The lines below will look better by this (len(enc_columns) > 0 for data containing categoricals, right?):

num_numericals, num_categoricals = self.feat_type.count('numerical'), self.feat_type.count('categorical') if num_numericals + num_categoricals != len(self.feat_type): raise ValueError("Elements of feat_type must be either ['numerical', 'categorical']") self.categorical_columns = list(range(num_categoricals)) self.numerical_columns = list(range(num_categoricals, num_categoricals + num_numericals))

autoPyTorch/data/tabular_feature_validator.py

nabenabe0928 · 2022-01-28T06:11:11Z

autoPyTorch/data/tabular_feature_validator.py

-            if self.all_nan_columns is not None and column in self.all_nan_columns:
-                continue


Why do not we need this anymore?

autoPyTorch/pipeline/components/setup/network_backbone/base_network_backbone.py

autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py

autoPyTorch/data/tabular_feature_validator.py

autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py

test/test_data/test_feature_validator.py

autoPyTorch/datasets/resampling_strategy.py

autoPyTorch/data/tabular_feature_validator.py

nabenabe0928 · 2022-02-02T10:30:13Z

autoPyTorch/data/tabular_feature_validator.py

+            if len(self.dtypes) != 0:
+                # when train data has no object dtype, but test does
+                # we prioritise the datatype given in training data
+                for column, data_type in zip(X.columns, self.dtypes):
+                    X[column] = X[column].astype(data_type)
+            else:
+                # Calling for the first time to infer the categories
+                X = X.infer_objects()
+                for column, data_type in zip(X.columns, X.dtypes):
+                    if not is_numeric_dtype(data_type):
+                        X[column] = X[column].astype('category')


Suggested change

if len(self.dtypes) != 0:

# when train data has no object dtype, but test does

# we prioritise the datatype given in training data

for column, data_type in zip(X.columns, self.dtypes):

X[column] = X[column].astype(data_type)

else:

# Calling for the first time to infer the categories

X = X.infer_objects()

for column, data_type in zip(X.columns, X.dtypes):

if not is_numeric_dtype(data_type):

X[column] = X[column].astype('category')

elif len(self.dtypes) != 0: # when train data has no object dtype, but test does

# we prioritise the datatype given in training data

for column, data_type in zip(X.columns, self.dtypes):

X[column] = X[column].astype(data_type)

else: # Calling for the first time to infer the categories

X = X.infer_objects()

for column, data_type in zip(X.columns, X.dtypes):

if not is_numeric_dtype(data_type):

X[column] = X[column].astype('category')

I think these are just preferences on where to start the comment.

Nono, it actually removed an indent level.

actually, if you notice we are also saving the dtypes in self.object_dtype_mapping which should be done for both of the two conditions you moved back an indent level. So, I think its fine the way it is.

Oh yeah, I did not notice, but I also did not notice that we still have the same issue (which happens when we have a huge number of features) in this method.
Could you fix it?

if hasattr(self, 'object_dtype_mapping'): # Mypy does not process the has attr. This dict is defined below try: X = X.astype(self.object_dtype_mapping) except Exception as e: self.logger.warning(f'Casting test data to data type in train data caused the exception {e}') pass return if len(self.dtypes) != 0: # when train data has no object dtype, but test does. Prioritise the datatype given in training data dtype_dict = {col: dtype for col, dtype in zip(X.columns, self.dtypes)} X = X.astype(dtype_dict) else: # Calling for the first time to infer the categories X = X.infer_objects() dtype_dict = {col: 'category' for col, dtype in zip(X.columns, X.dtypes) if not is_numeric_dtype(dtype)} X = X.astype(dtype_dict) # only numerical attributes and categories self.object_dtype_mapping = {col: dtype for col, dtype in zip(X.columns, X.dtypes)}

autoPyTorch/data/tabular_feature_validator.py

nabenabe0928 · 2022-02-03T01:13:22Z

Hey, thanks for your effort.
I think we can merge once you addressed this.

nabenabe0928 · 2022-02-08T18:19:00Z

Hi, thanks for the changes and sorry for the late response.
I will approve the changes.

autoPyTorch/api/tabular_classification.py

autoPyTorch/api/tabular_regression.py

autoPyTorch/data/tabular_feature_validator.py

autoPyTorch/datasets/resampling_strategy.py

autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py

nabenabe0928 · 2022-02-09T18:37:35Z

Looks good to me now:)

ArlindKadra · 2022-02-09T23:09:23Z

Thanks for the PR :). It looks good to me too now. I only have one last minor question.

* enable preprocessing and remove is_small_preprocess * address comments from shuhei and fix precommit checks * fix tests * fix precommit checks * add suggestions from shuhei for astype use * address speed issue when using object_dtype_mapping * make code more readable * improve documentation for base network embedding

ravinkohli requested review from ArlindKadra and nabenabe0928 January 25, 2022 14:27

nabenabe0928 reviewed Jan 28, 2022

View reviewed changes

nabenabe0928 requested changes Jan 28, 2022

View reviewed changes

ravinkohli force-pushed the reg_cocktails branch from 1c1ff8a to 2cdb40a Compare January 28, 2022 12:44

enable preprocessing and remove is_small_preprocess

6f4cf75

ravinkohli force-pushed the fix_preprocessing branch from 23d5fd4 to 6f4cf75 Compare January 28, 2022 12:51

address comments from shuhei and fix precommit checks

e6c37a0

ravinkohli requested a review from nabenabe0928 January 28, 2022 13:34

nabenabe0928 reviewed Jan 31, 2022

View reviewed changes

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved

ravinkohli added 2 commits February 1, 2022 22:31

fix tests

aa343f6

fix precommit checks

27848fd

ravinkohli force-pushed the fix_preprocessing branch from ceb9f19 to 27848fd Compare February 1, 2022 21:32

nabenabe0928 reviewed Feb 2, 2022

View reviewed changes

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved

nabenabe0928 reviewed Feb 2, 2022

View reviewed changes

autoPyTorch/data/tabular_feature_validator.py Outdated Show resolved Hide resolved

add suggestions from shuhei for astype use

1fa38d0

address speed issue when using object_dtype_mapping

4698386

ravinkohli added the enhancement New feature or request label Feb 3, 2022

nabenabe0928 approved these changes Feb 8, 2022

View reviewed changes

ArlindKadra reviewed Feb 9, 2022

View reviewed changes

make code more readable

ea177a6

ArlindKadra reviewed Feb 9, 2022

View reviewed changes

autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py Outdated Show resolved Hide resolved

improve documentation for base network embedding

547a99b

ArlindKadra approved these changes Feb 9, 2022

View reviewed changes

ravinkohli merged commit cca08d5 into reg_cocktails Feb 9, 2022

ravinkohli deleted the fix_preprocessing branch June 15, 2022 13:15

		if self.all_nan_columns is not None and column in self.all_nan_columns:
		continue

[FIX] Enable preprocessing in reg_cocktails #369

[FIX] Enable preprocessing in reg_cocktails #369

Uh oh!

Conversation

ravinkohli commented Jan 25, 2022

Types of changes

Checklist:

Description

Motivation and Context

How has this been tested?

Uh oh!

nabenabe0928 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 Jan 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 Jan 28, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 Feb 2, 2022

Choose a reason for hiding this comment

Uh oh!

ravinkohli Feb 2, 2022

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 Feb 3, 2022

Choose a reason for hiding this comment

Uh oh!

ravinkohli Feb 3, 2022

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 commented Feb 3, 2022

Uh oh!

nabenabe0928 commented Feb 8, 2022

Uh oh!

nabenabe0928 Feb 3, 2022 •

edited

Loading