-
Notifications
You must be signed in to change notification settings - Fork 299
[ADD] Minority Coalescer #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
838e63b
to
f88af24
Compare
...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for the PR. I have suggested minor changes, and I think after they are resolved, this PR can be merged
Codecov Report
@@ Coverage Diff @@
## development #242 +/- ##
================================================
- Coverage 81.62% 29.28% -52.35%
================================================
Files 150 154 +4
Lines 8625 8794 +169
Branches 1325 1351 +26
================================================
- Hits 7040 2575 -4465
- Misses 1108 6219 +5111
+ Partials 477 0 -477
Continue to review full report at Codecov.
|
787dcff
to
e40ed30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for the PR. This is a great feature to have. I have left just a single comment other than that, I think we can merge this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
This helps to prevent unexpectedly low performance.
However, some codes seem to include vulnerabilities, so first let's remove those issues.
|
||
if self.minimum_fraction is None: | ||
return X | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codes are a bit complicated and not simple, so I rewrote the codes.
def suggestion(X: Union[sparse.csr_matrix, np.ndarray]) -> None:
n_features = X.shape[1]
is_sparse = sparse.issparse(X)
for col in range(n_features):
if is_sparse:
indptr_start = X.indptr[col],
indptr_end = X.indptr[col + 1]
col_data = X.data[indptr_start:indptr_end]
else:
col_data = X[:, col]
mask = ~np.isin(col_data, do_not_coalesce_)
col_data[mask] = -2
return X
Just in case, I also checked if it works in the identical way.
size = (400000, 100)
n_vals = 10
do_not_coalesce_ = set([3, 5])
X = np.random.randint(n_vals, size=size)
X_copy = np.copy(X)
n_wrongs = (original(X) != suggestion(X_copy)).sum()
n_wrongs
>>> 0 # It works identically
Plus, I also checked the benchmark speed.
size = (400000, 100)
n_vals = 10
do_not_coalesce_ = set([3, 5])
X = np.random.randint(n_vals, size=size)
%timeit original(X)
>>> 1.24 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
X = np.random.randint(n_vals, size=size)
%timeit suggestion(X)
>>> 549 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
hist = np.histogram(Y[:, col], bins=np.arange(-2, 7)) | ||
np.testing.assert_array_almost_equal(hist[0], [10, 0, 0, 0, 0, 30, 30, 30]) | ||
# Assert no copies were made | ||
assert id(X) == id(Y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would like this method to share the address before and after the transformation, we should not return the array as a result, because it just means if we make any changes to the X
, it means Y
will also change unexpectedly.
It is not safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the name of the method (transform) indicates, it is within the expectation that the input argument will be transformed.
Normally the input can be an array of thousands of elements. Creating a copy every time we transform is not feasible, so we do expect to transform it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the copy take memory, so I would like it to be a destructive method.
Because we potentially change the original (the one before transformed) and those changes will be reflected to the transformed version as well.
In other words, we do not have to return the transformed version.
""" | ||
def __init__(self, minimum_fraction: float, random_state: Optional[Union[np.random.RandomState, int]] = None): | ||
super().__init__() | ||
self.minimum_fraction = minimum_fraction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimum_fraction
-> min_fraction
(convention)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a convention for this, and in this case, having the complete word is more clear. If possible I would like to preserve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytorch uses min
rather than minimum
.
When it uses minimum
, it is only for the element-wise minimum.
But if you would like to stick to it, it is also fine.
...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py
Show resolved
Hide resolved
|
||
class NoCoalescer(BaseCoalescer): | ||
""" | ||
Don't perform NoCoalescer on categorical features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a NoCoalescer
class. This allows the BO model to enable/disable coalescing.
The choice object selects between MinorityCoalescer and NoCoalescer depending on what gives better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean i did not get if you mean Do not perform NO coalescer
or Do not perform coalescer
.
...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/NoCoalescer.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/base_coalescer.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/base_coalescer.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/base_coalescer.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/base_coalescer.py
Outdated
Show resolved
Hide resolved
autoPyTorch/utils/implementations.py
Outdated
if fraction >= self.minimum_fraction: | ||
do_not_coalesce[-1].add(unique_value) | ||
|
||
self.do_not_coalesce_ = do_not_coalesce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we stick to self._<private var name>
rather than self.<private var name>_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have made the change. This is something we also have not discussed before, as it does not have any meaning for python.. It is how you want to make the convention. Maybe you can sync up with Ravin and have this be added to the wiki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will do it, but mostly I just follow pep8 rules.
Btw, is <var name>_
naming from another language?
autoPyTorch/utils/implementations.py
Outdated
for column in range(X.shape[1]): | ||
if sparse.issparse(X): | ||
indptr_start = X.indptr[column] | ||
indptr_end = X.indptr[column + 1] | ||
unique = np.unique(X.data[indptr_start:indptr_end]) | ||
for unique_value in unique: | ||
if unique_value not in self.do_not_coalesce_[column]: | ||
indptr_start = X.indptr[column] | ||
indptr_end = X.indptr[column + 1] | ||
X.data[indptr_start:indptr_end][ | ||
X.data[indptr_start:indptr_end] == unique_value] = -2 | ||
else: | ||
unique = np.unique(X[:, column]) | ||
unique_values = [unique_value for unique_value in unique | ||
if unique_value not in self.do_not_coalesce_[column]] | ||
mask = np.isin(X[:, column], unique_values) | ||
# The imputer uses -1 for unknown categories | ||
# Then -2 means coalesced categories | ||
X[mask, column] = -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for column in range(X.shape[1]): | |
if sparse.issparse(X): | |
indptr_start = X.indptr[column] | |
indptr_end = X.indptr[column + 1] | |
unique = np.unique(X.data[indptr_start:indptr_end]) | |
for unique_value in unique: | |
if unique_value not in self.do_not_coalesce_[column]: | |
indptr_start = X.indptr[column] | |
indptr_end = X.indptr[column + 1] | |
X.data[indptr_start:indptr_end][ | |
X.data[indptr_start:indptr_end] == unique_value] = -2 | |
else: | |
unique = np.unique(X[:, column]) | |
unique_values = [unique_value for unique_value in unique | |
if unique_value not in self.do_not_coalesce_[column]] | |
mask = np.isin(X[:, column], unique_values) | |
# The imputer uses -1 for unknown categories | |
# Then -2 means coalesced categories | |
X[mask, column] = -2 | |
n_features = X.shape[1] | |
is_sparse = sparse.issparse(X) | |
for col in range(n_features): | |
if is_sparse: | |
indptr_start = X.indptr[col], | |
indptr_end = X.indptr[col + 1] | |
col_data = X.data[indptr_start:indptr_end] | |
else: | |
col_data = X[:, col] | |
# The imputer uses -1 for unknown categories | |
# Then -2 means coalesced categories | |
mask = ~np.isin(col_data, self.do_not_coalesce_) | |
col_data[mask] = -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the #suggestion01
Let me know if you have any questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fit
and transform
methods in MinorityCoalescing
class can be merged when I refactor them.
Please take a look at the refactored version of both methods.
Just in case, I also put the refactored version of the merged method.
n_instances, n_features = X.shape
is_sparse = sparse.issparse(X)
for col in range(n_features):
if is_sparse:
indptr_start = X.indptr[column]
indptr_end = X.indptr[column + 1]
col_data = X.data[indptr_start:indptr_end]
else:
col_data = X[:, col]
unique_vals, counts = np.unique(col_data, return_counts=True)
frac = counts / n_instances
coalesce = set(unique_vals[frac <= min_frac])
# The imputer uses -1 for unknown categories
# Then -2 means coalesced categories
mask = np.isin(col_data, coalesce)
col_data[mask] = -2
for column in range(X.shape[1]): | ||
do_not_coalesce.append(set()) | ||
|
||
if sparse.issparse(X): | ||
indptr_start = X.indptr[column] | ||
indptr_end = X.indptr[column + 1] | ||
unique, counts = np.unique( | ||
X.data[indptr_start:indptr_end], return_counts=True) | ||
colsize = indptr_end - indptr_start | ||
else: | ||
unique, counts = np.unique(X[:, column], return_counts=True) | ||
colsize = X.shape[0] | ||
|
||
for unique_value, count in zip(unique, counts): | ||
fraction = float(count) / colsize | ||
if fraction >= self.minimum_fraction: | ||
do_not_coalesce[-1].add(unique_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some issues and refactored the codes.
Issues
- colsize changes depending on if the feature matrix is sparse or not (
indptr_end - indptr_start
is not correct) - the fraction also changes due to 1. (especially, almost no cut-off happens in sparse matrices.)
- Unneeded usage of for loop (rewrite by numpy to eliminate for loop, which I did)
This function can also be written nicely.
for column in range(X.shape[1]): | |
do_not_coalesce.append(set()) | |
if sparse.issparse(X): | |
indptr_start = X.indptr[column] | |
indptr_end = X.indptr[column + 1] | |
unique, counts = np.unique( | |
X.data[indptr_start:indptr_end], return_counts=True) | |
colsize = indptr_end - indptr_start | |
else: | |
unique, counts = np.unique(X[:, column], return_counts=True) | |
colsize = X.shape[0] | |
for unique_value, count in zip(unique, counts): | |
fraction = float(count) / colsize | |
if fraction >= self.minimum_fraction: | |
do_not_coalesce[-1].add(unique_value) | |
n_instances, n_features = X.shape | |
do_not_coalesce = [set() for _ in range(n_features)] | |
is_sparse = sparse.issparse(X) | |
for col, no_coal in enumerate(do_not_coalesce): | |
if is_sparse: | |
indptr_start = X.indptr[column] | |
indptr_end = X.indptr[column + 1] | |
col_data = X.data[indptr_start:indptr_end] | |
else: | |
col_data = X[:, col] | |
unique_vals, counts = np.unique(col_data, return_counts=True) | |
frac = counts / n_instances | |
no_coal.update(unique_vals[frac >= min_frac]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we do not want to merge fit and transform as we are using the scikit-learn API. You can read more about it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I will respond to the corresponding lines separately.
I will tag by #suggestion01 and #suggestion02.
Actually, this suggestion is #suggestion01 (the change for gettting do_not_coalesce
)
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py
Show resolved
Hide resolved
…ssing/coalescer/__init__.py Co-authored-by: nabenabe0928 <[email protected]>
I added some comments mostly small comments as responses. |
we have merged this with #376 |
addresses #239