Skip to content

TST/CLN: correctly skip in indexes/common; add test for duplicated #21902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 10, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 43 additions & 16 deletions pandas/tests/indexes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,6 @@ def verify_pickle(self, indices):
assert indices.equals(unpickled)

def test_pickle_compat_construction(self):
# this is testing for pickle compat
if self._holder is None:
return

# need an object to create with
pytest.raises(TypeError, self._holder)

Expand Down Expand Up @@ -236,7 +232,7 @@ def test_set_name_methods(self, indices):

# don't tests a MultiIndex here (as its tested separated)
if isinstance(indices, MultiIndex):
return
pytest.skip('Skip check for MultiIndex')
original_name = indices.name
new_ind = indices.set_names([new_name])
assert new_ind.name == new_name
Expand Down Expand Up @@ -333,7 +329,8 @@ def test_copy_and_deepcopy(self, indices):
from copy import copy, deepcopy

if isinstance(indices, MultiIndex):
return
pytest.skip('Skip check for MultiIndex')

for func in (copy, deepcopy):
idx_copy = func(indices)
assert idx_copy is not indices
Expand All @@ -342,20 +339,50 @@ def test_copy_and_deepcopy(self, indices):
new_copy = indices.copy(deep=True, name="banana")
assert new_copy.name == "banana"

def test_duplicates(self, indices):
def test_has_duplicates(self, indices):
if type(indices) is not self._holder:
return
pytest.skip('Can only check if we have the correct type')
if not len(indices) or isinstance(indices, MultiIndex):
return
# MultiIndex tested separately in:
# tests/indexes/multi/test_unique_and_duplicates
pytest.skip('Skip check for empty Index and MultiIndex')

idx = self._holder([indices[0]] * 5)
assert not idx.is_unique
assert idx.has_duplicates

@pytest.mark.parametrize('keep', ['first', 'last', False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a duplicate these (pun intended) or are not testing indices duplicated currently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback .duplicated is hardly ever tested directly, only indirectly for stuff like .drop_duplicates. Regardless of the changes to .duplicated in #21645, I think duplicated should be tested separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are not relevant?

pandas/tests/indexes/common.py:    def test_duplicates(self, indices):
pandas/tests/indexes/test_category.py:    def test_duplicates(self):
pandas/tests/indexes/test_range.py:    def test_duplicates(self):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback No, test_duplicates (at least for pandas/tests/indexes/common.py, which is what this PR is about) tests .is_unique and .has_duplicates, but not the .duplicated-method itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point to the coverage that shows this is NOT tested?

Copy link
Contributor Author

@h-vetinari h-vetinari Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I never said that it is not tested, just that it is only tested implicitly. Any call to .drop_duplicates will invoke duplicated, so obviously the coverage works out.

def test_duplicated(self, indices, keep):
if type(indices) is not self._holder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use isinstance here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback isinstance of what? I copied the checks from other tests, because I didn't know all the different cases that flow into common.py.

Copy link
Contributor

@jreback jreback Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance of self._holder, is this what we do elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in indexes/common.py, as far as I can see. There's several isinstance of course, but never for self._holder, which can apparently be None as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback If I change

if type(indices) is not self._holder:

to

if not isinstance(indices, self._holder):

I get 9 failures (instead of skips), all of which are from

tests/indexes/test_numeric.TestInt64Index, when run against DatetimeIndex, PeriodIndex or TimedeltaIndex, mainly because of TypeError: Unsafe NumPy casting, you must explicitly cast it seems.

Copy link
Contributor Author

@h-vetinari h-vetinari Jul 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Seem to have found something: in my normal workbook, I get

pd.__version__
# '0.24.0.dev0+321.g0fe6ded52.dirty'
isinstance(pd.PeriodIndex, pd.Int64Index)
# False

but for some reason, within tests/indexes/common.py, this is the opposite:

    [...output of test run...]
    
    @pytest.mark.parametrize('keep', ['first', 'last', False])
    def test_duplicated(self, indices, keep):
        if not isinstance(indices, self._holder):
            pytest.skip('Can only check if we know the index type')
        if not len(indices) or isinstance(indices, (MultiIndex)):
            # MultiIndex tested separately in:
            # tests/indexes/multi/test_unique_and_duplicates
            pytest.skip('Skip check for empty Index and MultiIndex')
        if isinstance(indices, (PeriodIndex, DatetimeIndex)):
            # this branch should be impossible for Int64Index
            # after the instance-check above!
>           raise ValueError(f'{type(indices).__name__}, {self._holder.__name__}, '
                             f'{isinstance(indices, self._holder)}, {type(indices) is self._holder}')
E           ValueError: PeriodIndex, Int64Index, True, False

Same happens for DatetimeIndex. It does work for the original is not variant, so I'm leaving that as it is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a follow-up: #22211

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I had gotten confused between instances and classes.

So DatetimeIndex, PeriodIndex and TimedeltaIndex are subclasses of Int64Index - but are unsafe to cast to Int64...? I guess this is intentional?

In any case, under these circumstances, I'm even more convinced that it's best to just stay with the if type(indices) is not self._holder: condition.

pytest.skip('Can only check if we know the index type')
if not len(indices) or isinstance(indices, MultiIndex):
# MultiIndex tested separately in:
# tests/indexes/multi/test_unique_and_duplicates
pytest.skip('Skip check for empty Index and MultiIndex')

idx = self._holder(indices)
if idx.has_duplicates:
# We are testing the duplicated-method here, so we need to know
# exactly which indices are duplicate and how (for the result).
# This is not possible if "idx" has duplicates already, which we
# therefore remove. This is seemingly circular, as drop_duplicates
# invokes duplicated, but in the end, it all works out because we
# cross-check with Series.duplicated, which is tested separately.
idx = idx.drop_duplicates()

n, k = len(idx), 10
duplicated_selection = np.random.choice(n, k * n)
expected = pd.Series(duplicated_selection).duplicated(keep=keep).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to check as a numpy array, much prefer to check the type and use assert_index_equal or assert_series_equal. is this how the other tests are?

Copy link
Contributor Author

@h-vetinari h-vetinari Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, index.duplicated() (without return_inverse) yields a numpy array - this is the documented signature (I guess because selecting on an Index really only needs an ndarray).

All the manual duplicated-tests actually create their own data, and know what the correct outcome should be. Here, we're feeding tons of different things through that test, so we need to determine - as I'm doing with duplicated_selection what is actually duplicate; self._holder(idx.values[duplicated_selection]) is then a duplicate Index of the correct type, but we know where its duplicates are (from inspecting duplicated_selection), and therefore can validate.

idx = self._holder(idx.values[duplicated_selection])

result = idx.duplicated(keep=keep)
tm.assert_numpy_array_equal(result, expected)

def test_unique(self, indices):
# don't test a MultiIndex here (as its tested separated)
# don't test a CategoricalIndex because categories change (GH 18291)
if isinstance(indices, (MultiIndex, CategoricalIndex)):
return
pytest.skip('Skip check for MultiIndex/CategoricalIndex')

# GH 17896
expected = indices.drop_duplicates()
Expand All @@ -375,7 +402,7 @@ def test_unique_na(self):
def test_get_unique_index(self, indices):
# MultiIndex tested separately
if not len(indices) or isinstance(indices, MultiIndex):
return
pytest.skip('Skip check for empty Index and MultiIndex')

idx = indices[[0] * 5]
idx_unique = indices[[0]]
Expand All @@ -394,7 +421,7 @@ def test_get_unique_index(self, indices):

# nans:
if not indices._can_hold_na:
return
pytest.skip('Skip na-check if index cannot hold na')

if needs_i8_conversion(indices):
vals = indices.asi8[[0] * 5]
Expand Down Expand Up @@ -423,7 +450,7 @@ def test_sort(self, indices):

def test_mutability(self, indices):
if not len(indices):
return
pytest.skip('Skip check for empty Index')
pytest.raises(TypeError, indices.__setitem__, 0, indices[0])

def test_view(self, indices):
Expand Down Expand Up @@ -761,7 +788,7 @@ def test_equals_op(self):
# GH9947, GH10637
index_a = self.create_index()
if isinstance(index_a, PeriodIndex):
return
pytest.skip('Skip check for PeriodIndex')

n = len(index_a)
index_b = index_a[0:-1]
Expand Down Expand Up @@ -989,11 +1016,11 @@ def test_searchsorted_monotonic(self, indices):
# not implemented for tuple searches in MultiIndex
# or Intervals searches in IntervalIndex
if isinstance(indices, (MultiIndex, IntervalIndex)):
return
pytest.skip('Skip check for MultiIndex/IntervalIndex')

# nothing to test if the index is empty
if indices.empty:
return
pytest.skip('Skip check for empty Index')
value = indices[0]

# determine the expected results (handle dupes for 'right')
Expand Down
5 changes: 4 additions & 1 deletion pandas/tests/indexes/test_category.py
Original file line number Diff line number Diff line change
Expand Up @@ -590,12 +590,15 @@ def test_is_unique(self, values, expected):
ci = CategoricalIndex(values)
assert ci.is_unique is expected

def test_duplicates(self):
def test_has_duplicates(self):

idx = CategoricalIndex([0, 0, 0], name='foo')
assert not idx.is_unique
assert idx.has_duplicates

def test_drop_duplicates(self):

idx = CategoricalIndex([0, 0, 0], name='foo')
expected = CategoricalIndex([0], name='foo')
tm.assert_index_equal(idx.drop_duplicates(), expected)
tm.assert_index_equal(idx.unique(), expected)
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/indexes/test_range.py
Original file line number Diff line number Diff line change
Expand Up @@ -806,7 +806,7 @@ def test_explicit_conversions(self):
result = a - fidx
tm.assert_index_equal(result, expected)

def test_duplicates(self):
def test_has_duplicates(self):
for ind in self.indices:
if not len(ind):
continue
Expand Down