-
Notifications
You must be signed in to change notification settings - Fork 299
[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding) #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding) #437
Conversation
c2a98c9
to
f2f5f72
Compare
port=X['logger_port' | ||
] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port=X['logger_port' | |
] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT, | |
port=X.get('logger_port', logging.handlers.DEFAULT_TCP_LOGGING_PORT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we don't need this code to be merged, I'll remove this.
else: | ||
multipliers.append(arr_dtypes[col].itemsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in one-hot encoding when num_cat
is larger than MIN_CATEGORIES_FOR_EMBEDDING_MAX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are not one hot encoded but rather sent to the PyTorch embedding module where there is implicit one-hot encoding.
raise ValueError(err_msg) | ||
for col, num_cat in zip(categorical_columns, n_categories_per_cat_column): | ||
if num_cat < MIN_CATEGORIES_FOR_EMBEDDING_MAX: | ||
multipliers.append(num_cat * arr_dtypes[col].itemsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it already guaranteed that all columns are non-object?
Otherwise, we should check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes its guaranteed that all columns are not object, moreover, they are also guaranteed to be np arrays, as this code is run after we have transformed the data using the tabular feature validator.
if len(categorical_columns) > 0: | ||
if n_categories_per_cat_column is None: | ||
raise ValueError(err_msg) | ||
for col, num_cat in zip(categorical_columns, n_categories_per_cat_column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use sum(...)
same as below. (optional)
Co-authored-by: nabenabe0928 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the meeting, I reviewed the changes. Everything looks good to me, I'm just adding a minor suggestion as a comment.
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py
Outdated
Show resolved
Hide resolved
…ssing/TabularColumnTransformer.py
…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <[email protected]>
…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <[email protected]>
This PR aims to improve the approximate memory usage of a dataset by considering the dataset after transforming with one-hot encoding. Based on our experiments (reg cocktails ablation study), we have observed that columns with a high cardinality of their categorical features tend to explode when they are one-hot encoded. Moreover, even with the addition of PyTorch embedding (removing the need to one hot encode all categorical columns), we observe that excessive memory is used while building the neural network.