[ADD] Minority Coalescer #242

franchuterivera · 2021-05-25T21:06:06Z

addresses #239

...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py

autoPyTorch/utils/implementations.py

ravinkohli

Hey, thanks for the PR. I have suggested minor changes, and I think after they are resolved, this PR can be merged

codecov · 2021-06-10T09:05:27Z

Codecov Report

Merging #242 (afd4f00) into development (999f3c3) will decrease coverage by 52.34%.
The diff coverage is 31.63%.

@@               Coverage Diff                @@
##           development     #242       +/-   ##
================================================
- Coverage        81.62%   29.28%   -52.35%     
================================================
  Files              150      154        +4     
  Lines             8625     8794      +169     
  Branches          1325     1351       +26     
================================================
- Hits              7040     2575     -4465     
- Misses            1108     6219     +5111     
+ Partials           477        0      -477

Impacted Files	Coverage Δ
autoPyTorch/api/base_task.py	`7.24% <0.00%> (-77.40%)`	⬇️
autoPyTorch/utils/implementations.py	`21.97% <17.64%> (-53.64%)`	⬇️
...essing/tabular_preprocessing/coalescer/__init__.py	`24.61% <24.61%> (ø)`
.../tabular_preprocessing/coalescer/base_coalescer.py	`50.00% <50.00%> (ø)`
...ing/tabular_preprocessing/coalescer/NoCoalescer.py	`56.25% <56.25%> (ø)`
...bular_preprocessing/coalescer/MinorityCoalescer.py	`58.33% <58.33%> (ø)`
autoPyTorch/pipeline/tabular_classification.py	`31.74% <100.00%> (-44.26%)`	⬇️
autoPyTorch/pipeline/tabular_regression.py	`37.37% <100.00%> (-45.28%)`	⬇️
autoPyTorch/optimizer/utils.py	`0.00% <0.00%> (-100.00%)`	⬇️
autoPyTorch/api/tabular_regression.py	`0.00% <0.00%> (-96.88%)`	⬇️
... and 140 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 999f3c3...afd4f00. Read the comment docs.

autoPyTorch/api/base_task.py

ravinkohli

Hey, thanks for the PR. This is a great feature to have. I have left just a single comment other than that, I think we can merge this PR.

nabenabe0928

Thanks for the PR.
This helps to prevent unexpectedly low performance.
However, some codes seem to include vulnerabilities, so first let's remove those issues.

test/test_pipeline/components/preprocessing/test_coalesce.py

autoPyTorch/api/base_task.py

autoPyTorch/utils/implementations.py

nabenabe0928 · 2021-06-20T20:12:30Z

autoPyTorch/utils/implementations.py

+
+        if self.minimum_fraction is None:
+            return X
+


The codes are a bit complicated and not simple, so I rewrote the codes.

def suggestion(X: Union[sparse.csr_matrix, np.ndarray]) -> None: n_features = X.shape[1] is_sparse = sparse.issparse(X) for col in range(n_features): if is_sparse: indptr_start = X.indptr[col], indptr_end = X.indptr[col + 1] col_data = X.data[indptr_start:indptr_end] else: col_data = X[:, col] mask = ~np.isin(col_data, do_not_coalesce_) col_data[mask] = -2 return X

Just in case, I also checked if it works in the identical way.

size = (400000, 100) n_vals = 10 do_not_coalesce_ = set([3, 5]) X = np.random.randint(n_vals, size=size) X_copy = np.copy(X) n_wrongs = (original(X) != suggestion(X_copy)).sum() n_wrongs >>> 0 # It works identically

Plus, I also checked the benchmark speed.

size = (400000, 100) n_vals = 10 do_not_coalesce_ = set([3, 5]) X = np.random.randint(n_vals, size=size) %timeit original(X) >>> 1.24 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) X = np.random.randint(n_vals, size=size) %timeit suggestion(X) >>> 549 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

nabenabe0928 · 2021-06-20T20:20:25Z

test/test_utils/test_coalescer.py

+        hist = np.histogram(Y[:, col], bins=np.arange(-2, 7))
+        np.testing.assert_array_almost_equal(hist[0], [10, 0, 0, 0, 0, 30, 30, 30])
+    # Assert no copies were made
+    assert id(X) == id(Y)


If you would like this method to share the address before and after the transformation, we should not return the array as a result, because it just means if we make any changes to the X, it means Y will also change unexpectedly.
It is not safe.

As the name of the method (transform) indicates, it is within the expectation that the input argument will be transformed.

Normally the input can be an array of thousands of elements. Creating a copy every time we transform is not feasible, so we do expect to transform it.

Yes, the copy take memory, so I would like it to be a destructive method.
Because we potentially change the original (the one before transformed) and those changes will be reflected to the transformed version as well.
In other words, we do not have to return the transformed version.

autoPyTorch/utils/implementations.py

nabenabe0928 · 2021-06-21T10:19:49Z

...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py

+    """
+    def __init__(self, minimum_fraction: float, random_state: Optional[Union[np.random.RandomState, int]] = None):
+        super().__init__()
+        self.minimum_fraction = minimum_fraction


minimum_fraction -> min_fraction (convention)

I don't think we have a convention for this, and in this case, having the complete word is more clear. If possible I would like to preserve it.

Pytorch uses min rather than minimum.
When it uses minimum, it is only for the element-wise minimum.
But if you would like to stick to it, it is also fine.

...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py

nabenabe0928 · 2021-06-21T10:22:43Z

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/NoCoalescer.py

+
+class NoCoalescer(BaseCoalescer):
+    """
+    Don't perform NoCoalescer on categorical features


This is a NoCoalescer class. This allows the BO model to enable/disable coalescing.

The choice object selects between MinorityCoalescer and NoCoalescer depending on what gives better performance.

I mean i did not get if you mean Do not perform NO coalescer or Do not perform coalescer.

...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/NoCoalescer.py

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/base_coalescer.py

autoPyTorch/utils/implementations.py

nabenabe0928 · 2021-06-21T10:45:38Z

autoPyTorch/utils/implementations.py

+                if fraction >= self.minimum_fraction:
+                    do_not_coalesce[-1].add(unique_value)
+
+        self.do_not_coalesce_ = do_not_coalesce


Can we stick to self._<private var name> rather than self.<private var name>_?

Yes, I have made the change. This is something we also have not discussed before, as it does not have any meaning for python.. It is how you want to make the convention. Maybe you can sync up with Ravin and have this be added to the wiki.

Ok, I will do it, but mostly I just follow pep8 rules.
Btw, is <var name>_ naming from another language?

nabenabe0928 · 2021-06-21T11:03:00Z

autoPyTorch/utils/implementations.py

+        for column in range(X.shape[1]):
+            if sparse.issparse(X):
+                indptr_start = X.indptr[column]
+                indptr_end = X.indptr[column + 1]
+                unique = np.unique(X.data[indptr_start:indptr_end])
+                for unique_value in unique:
+                    if unique_value not in self.do_not_coalesce_[column]:
+                        indptr_start = X.indptr[column]
+                        indptr_end = X.indptr[column + 1]
+                        X.data[indptr_start:indptr_end][
+                            X.data[indptr_start:indptr_end] == unique_value] = -2
+            else:
+                unique = np.unique(X[:, column])
+                unique_values = [unique_value for unique_value in unique
+                                 if unique_value not in self.do_not_coalesce_[column]]
+                mask = np.isin(X[:, column], unique_values)
+                # The imputer uses -1 for unknown categories
+                # Then -2 means coalesced categories
+                X[mask, column] = -2


Suggested change

for column in range(X.shape[1]):

if sparse.issparse(X):

indptr_start = X.indptr[column]

indptr_end = X.indptr[column + 1]

unique = np.unique(X.data[indptr_start:indptr_end])

for unique_value in unique:

if unique_value not in self.do_not_coalesce_[column]:

indptr_start = X.indptr[column]

indptr_end = X.indptr[column + 1]

X.data[indptr_start:indptr_end][

X.data[indptr_start:indptr_end] == unique_value] = -2

else:

unique = np.unique(X[:, column])

unique_values = [unique_value for unique_value in unique

if unique_value not in self.do_not_coalesce_[column]]

mask = np.isin(X[:, column], unique_values)

# The imputer uses -1 for unknown categories

# Then -2 means coalesced categories

X[mask, column] = -2

n_features = X.shape[1]

is_sparse = sparse.issparse(X)

for col in range(n_features):

if is_sparse:

indptr_start = X.indptr[col],

indptr_end = X.indptr[col + 1]

col_data = X.data[indptr_start:indptr_end]

else:

col_data = X[:, col]

# The imputer uses -1 for unknown categories

# Then -2 means coalesced categories

mask = ~np.isin(col_data, self.do_not_coalesce_)

col_data[mask] = -2

Here is the #suggestion01
Let me know if you have any questions.

nabenabe0928

I think fit and transform methods in MinorityCoalescing class can be merged when I refactor them.
Please take a look at the refactored version of both methods.

Just in case, I also put the refactored version of the merged method.

        n_instances, n_features = X.shape
        is_sparse = sparse.issparse(X)

        for col in range(n_features):
            if is_sparse:
                indptr_start = X.indptr[column]
                indptr_end = X.indptr[column + 1]
                col_data = X.data[indptr_start:indptr_end]
            else:
                col_data = X[:, col]

            unique_vals, counts = np.unique(col_data, return_counts=True)
            frac = counts / n_instances
            coalesce = set(unique_vals[frac <= min_frac])

            # The imputer uses -1 for unknown categories
            # Then -2 means coalesced categories
            mask = np.isin(col_data, coalesce)
            col_data[mask] = -2

nabenabe0928 · 2021-06-21T11:46:43Z

autoPyTorch/utils/implementations.py

+        for column in range(X.shape[1]):
+            do_not_coalesce.append(set())
+
+            if sparse.issparse(X):
+                indptr_start = X.indptr[column]
+                indptr_end = X.indptr[column + 1]
+                unique, counts = np.unique(
+                    X.data[indptr_start:indptr_end], return_counts=True)
+                colsize = indptr_end - indptr_start
+            else:
+                unique, counts = np.unique(X[:, column], return_counts=True)
+                colsize = X.shape[0]
+
+            for unique_value, count in zip(unique, counts):
+                fraction = float(count) / colsize
+                if fraction >= self.minimum_fraction:
+                    do_not_coalesce[-1].add(unique_value)


I found some issues and refactored the codes.
Issues

colsize changes depending on if the feature matrix is sparse or not (indptr_end - indptr_start is not correct)

the fraction also changes due to 1. (especially, almost no cut-off happens in sparse matrices.)

Unneeded usage of for loop (rewrite by numpy to eliminate for loop, which I did)

This function can also be written nicely.

Suggested change

for column in range(X.shape[1]):

do_not_coalesce.append(set())

if sparse.issparse(X):

indptr_start = X.indptr[column]

indptr_end = X.indptr[column + 1]

unique, counts = np.unique(

X.data[indptr_start:indptr_end], return_counts=True)

colsize = indptr_end - indptr_start

else:

unique, counts = np.unique(X[:, column], return_counts=True)

colsize = X.shape[0]

for unique_value, count in zip(unique, counts):

fraction = float(count) / colsize

if fraction >= self.minimum_fraction:

do_not_coalesce[-1].add(unique_value)

n_instances, n_features = X.shape

do_not_coalesce = [set() for _ in range(n_features)]

is_sparse = sparse.issparse(X)

for col, no_coal in enumerate(do_not_coalesce):

if is_sparse:

indptr_start = X.indptr[column]

indptr_end = X.indptr[column + 1]

col_data = X.data[indptr_start:indptr_end]

else:

col_data = X[:, col]

unique_vals, counts = np.unique(col_data, return_counts=True)

frac = counts / n_instances

no_coal.update(unique_vals[frac >= min_frac])

Yes, but we do not want to merge fit and transform as we are using the scikit-learn API. You can read more about it here

Ok, then I will respond to the corresponding lines separately.
I will tag by #suggestion01 and #suggestion02.
Actually, this suggestion is #suggestion01 (the change for gettting do_not_coalesce)

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/__init__.py

…ssing/coalescer/__init__.py Co-authored-by: nabenabe0928 <[email protected]>

… into featurecoal

nabenabe0928 · 2021-06-24T16:03:55Z

I added some comments mostly small comments as responses.
And about two large suggestions.
Those two return exactly the same results, but my codes use numpy nicely.

ravinkohli · 2022-02-10T17:59:33Z

we have merged this with #376

franchuterivera force-pushed the featurecoal branch from 838e63b to f88af24 Compare June 1, 2021 12:12

franchuterivera marked this pull request as ready for review June 1, 2021 14:53

franchuterivera requested a review from ravinkohli June 1, 2021 14:53

ravinkohli reviewed Jun 3, 2021

View reviewed changes

...Torch/pipeline/components/preprocessing/tabular_preprocessing/coalescer/MinorityCoalescer.py Show resolved Hide resolved

ravinkohli reviewed Jun 3, 2021

View reviewed changes

autoPyTorch/utils/implementations.py Show resolved Hide resolved

ravinkohli suggested changes Jun 3, 2021

View reviewed changes

franchuterivera added 6 commits June 18, 2021 13:17

[ADD] first push of coalescer

d85c16e

Add coalescer

ae0652e

Fix Unit test

90c71eb

Further fixes

43213bb

Ignore as in preprocessing

e107e93

call args still failing

e40ed30

franchuterivera force-pushed the featurecoal branch from 787dcff to e40ed30 Compare June 18, 2021 11:18

FIX_refit

3d25f37

franchuterivera requested a review from ravinkohli June 18, 2021 15:03

ravinkohli reviewed Jun 19, 2021

View reviewed changes

autoPyTorch/api/base_task.py Show resolved Hide resolved

ravinkohli approved these changes Jun 19, 2021

View reviewed changes

nabenabe0928 requested changes Jun 20, 2021

View reviewed changes

nabenabe0928 requested changes Jun 21, 2021

View reviewed changes

franchuterivera and others added 3 commits June 21, 2021 21:58

Update autoPyTorch/pipeline/components/preprocessing/tabular_preproce…

2e08e99

…ssing/coalescer/__init__.py Co-authored-by: nabenabe0928 <[email protected]>

Feedback from Shuhei

c8aa626

Merge branch 'featurecoal' of github.com:franchuterivera/Auto-PyTorch…

afd4f00

… into featurecoal

ravinkohli marked this pull request as draft November 11, 2021 20:26

ravinkohli closed this Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADD] Minority Coalescer #242

[ADD] Minority Coalescer #242

franchuterivera commented May 25, 2021

ravinkohli left a comment

codecov bot commented Jun 10, 2021 •

edited

Loading

ravinkohli left a comment

nabenabe0928 left a comment

nabenabe0928 Jun 20, 2021 •

edited

Loading

nabenabe0928 Jun 20, 2021

franchuterivera Jun 21, 2021

nabenabe0928 Jun 24, 2021

nabenabe0928 Jun 21, 2021

franchuterivera Jun 21, 2021

nabenabe0928 Jun 24, 2021

nabenabe0928 Jun 21, 2021

franchuterivera Jun 21, 2021

nabenabe0928 Jun 24, 2021

nabenabe0928 Jun 21, 2021

franchuterivera Jun 21, 2021

nabenabe0928 Jun 24, 2021

nabenabe0928 Jun 21, 2021

nabenabe0928 Jun 24, 2021

nabenabe0928 left a comment •

edited

Loading

nabenabe0928 Jun 21, 2021 •

edited

Loading

franchuterivera Jun 21, 2021

nabenabe0928 Jun 24, 2021 •

edited

Loading

nabenabe0928 commented Jun 24, 2021

ravinkohli commented Feb 10, 2022

[ADD] Minority Coalescer #242

[ADD] Minority Coalescer #242

Conversation

franchuterivera commented May 25, 2021

ravinkohli left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 10, 2021 • edited Loading

Codecov Report

ravinkohli left a comment

Choose a reason for hiding this comment

nabenabe0928 left a comment

Choose a reason for hiding this comment

nabenabe0928 Jun 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nabenabe0928 left a comment • edited Loading

Choose a reason for hiding this comment

nabenabe0928 Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nabenabe0928 Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

nabenabe0928 commented Jun 24, 2021

ravinkohli commented Feb 10, 2022

codecov bot commented Jun 10, 2021 •

edited

Loading

nabenabe0928 Jun 20, 2021 •

edited

Loading

nabenabe0928 left a comment •

edited

Loading

nabenabe0928 Jun 21, 2021 •

edited

Loading

nabenabe0928 Jun 24, 2021 •

edited

Loading