-
Notifications
You must be signed in to change notification settings - Fork 299
[FIX] Enable preprocessing in reg_cocktails #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I reviewed all the files except tabular_feature_validator.py related (two) files.
autoPyTorch/pipeline/components/training/data_loader/image_data_loader.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/training/data_loader/feature_data_loader.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/training/data_loader/base_data_loader.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/training/data_loader/base_data_loader.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_backbone/base_network_backbone.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/early_preprocessor/EarlyPreprocessing.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments on tabular_feature_validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the tabular_feature_validator
# columns are shifted to the left | ||
list(range(len(cat))) | ||
for cat in encoded_categories | ||
] | ||
|
||
# differently to categorical_columns and numerical_columns, | ||
# this saves the index of the column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines below will look better by this (len(enc_columns) > 0
for data containing categoricals, right?):
num_numericals, num_categoricals = self.feat_type.count('numerical'), self.feat_type.count('categorical')
if num_numericals + num_categoricals != len(self.feat_type):
raise ValueError("Elements of feat_type must be either ['numerical', 'categorical']")
self.categorical_columns = list(range(num_categoricals))
self.numerical_columns = list(range(num_categoricals, num_categoricals + num_numericals))
if self.all_nan_columns is not None and column in self.all_nan_columns: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do not we need this anymore?
1c1ff8a
to
2cdb40a
Compare
23d5fd4
to
6f4cf75
Compare
autoPyTorch/pipeline/components/setup/network_backbone/base_network_backbone.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Show resolved
Hide resolved
ceb9f19
to
27848fd
Compare
if len(self.dtypes) != 0: | ||
# when train data has no object dtype, but test does | ||
# we prioritise the datatype given in training data | ||
for column, data_type in zip(X.columns, self.dtypes): | ||
X[column] = X[column].astype(data_type) | ||
else: | ||
# Calling for the first time to infer the categories | ||
X = X.infer_objects() | ||
for column, data_type in zip(X.columns, X.dtypes): | ||
if not is_numeric_dtype(data_type): | ||
X[column] = X[column].astype('category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(self.dtypes) != 0: | |
# when train data has no object dtype, but test does | |
# we prioritise the datatype given in training data | |
for column, data_type in zip(X.columns, self.dtypes): | |
X[column] = X[column].astype(data_type) | |
else: | |
# Calling for the first time to infer the categories | |
X = X.infer_objects() | |
for column, data_type in zip(X.columns, X.dtypes): | |
if not is_numeric_dtype(data_type): | |
X[column] = X[column].astype('category') | |
elif len(self.dtypes) != 0: # when train data has no object dtype, but test does | |
# we prioritise the datatype given in training data | |
for column, data_type in zip(X.columns, self.dtypes): | |
X[column] = X[column].astype(data_type) | |
else: # Calling for the first time to infer the categories | |
X = X.infer_objects() | |
for column, data_type in zip(X.columns, X.dtypes): | |
if not is_numeric_dtype(data_type): | |
X[column] = X[column].astype('category') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are just preferences on where to start the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nono, it actually removed an indent level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, if you notice we are also saving the dtypes in self.object_dtype_mapping
which should be done for both of the two conditions you moved back an indent level. So, I think its fine the way it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I did not notice, but I also did not notice that we still have the same issue (which happens when we have a huge number of features) in this method.
Could you fix it?
if hasattr(self, 'object_dtype_mapping'):
# Mypy does not process the has attr. This dict is defined below
try:
X = X.astype(self.object_dtype_mapping)
except Exception as e:
self.logger.warning(f'Casting test data to data type in train data caused the exception {e}')
pass
return
if len(self.dtypes) != 0:
# when train data has no object dtype, but test does. Prioritise the datatype given in training data
dtype_dict = {col: dtype for col, dtype in zip(X.columns, self.dtypes)}
X = X.astype(dtype_dict)
else:
# Calling for the first time to infer the categories
X = X.infer_objects()
dtype_dict = {col: 'category' for col, dtype in zip(X.columns, X.dtypes) if not is_numeric_dtype(dtype)}
X = X.astype(dtype_dict)
# only numerical attributes and categories
self.object_dtype_mapping = {col: dtype for col, dtype in zip(X.columns, X.dtypes)}
Hey, thanks for your effort. |
Hi, thanks for the changes and sorry for the late response. |
autoPyTorch/pipeline/components/setup/network_embedding/base_network_embedding.py
Outdated
Show resolved
Hide resolved
Looks good to me now:) |
Thanks for the PR :). It looks good to me too now. I only have one last minor question. |
* enable preprocessing and remove is_small_preprocess * address comments from shuhei and fix precommit checks * fix tests * fix precommit checks * add suggestions from shuhei for astype use * address speed issue when using object_dtype_mapping * make code more readable * improve documentation for base network embedding
* enable preprocessing and remove is_small_preprocess * address comments from shuhei and fix precommit checks * fix tests * fix precommit checks * add suggestions from shuhei for astype use * address speed issue when using object_dtype_mapping * make code more readable * improve documentation for base network embedding
* enable preprocessing and remove is_small_preprocess * address comments from shuhei and fix precommit checks * fix tests * fix precommit checks * add suggestions from shuhei for astype use * address speed issue when using object_dtype_mapping * make code more readable * improve documentation for base network embedding
* enable preprocessing and remove is_small_preprocess * address comments from shuhei and fix precommit checks * fix tests * fix precommit checks * add suggestions from shuhei for astype use * address speed issue when using object_dtype_mapping * make code more readable * improve documentation for base network embedding
* enable preprocessing and remove is_small_preprocess * address comments from shuhei and fix precommit checks * fix tests * fix precommit checks * add suggestions from shuhei for astype use * address speed issue when using object_dtype_mapping * make code more readable * improve documentation for base network embedding
Types of changes
Note that a Pull Request should only contain one of refactoring, new features or documentation changes.
Please separate these changes and send us individual PRs for each.
For more information on how to create a good pull request, please refer to The anatomy of a perfect pull request.
Checklist:
Description
Following the discussion in last weeks meeting, we would like to have preprocessing as part of the pipeline in regularisation cocktails. Therefore this PR makes the necessary changes to achieve that. These include uncommenting the various preprocessor transforms as well as code for learning embeddings. Moreover, this PR also removes the
is_small_preprocess
attribute fromBaseDataset
as we will always preprocess early in theEarlyPreprocessing
node.Motivation and Context
It is required to allow searching for preprocessing hyperparameters in the reg_cocktails branch. Also, to avoid redundant preprocessing of the dataset for each epoch, we would like to preprocess the whole data all at once.
How has this been tested?
Previous tests for preprocessing and early preprocessing have been enabled. Also tests for embedding module have been enabled to check if embedding is compatible with the preprocessing approach