Skip to content

Introduction of RangeIndex #9961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

Introduction of RangeIndex #9961

wants to merge 13 commits into from

Conversation

ARF1
Copy link

@ARF1 ARF1 commented Apr 21, 2015

RangeIndex(1, 10, 2) is a memory saving alternative to Index(np.arange(1, 10,2)): c.f. #939.

This re-implementation is compatible with the current Index() api and is a drop-in replacement for Int64Index(). It automatically converts to Int64Index() when required by operations.

At present only for a minimum number of operations the type is conserved (e.g. slicing, inner-, left- and right-joins). Most other operations trigger creation of an equivalent Int64Index (or at least an equivalent numpy array) and fall back to its implementation.

This PR also extends the functionality of the Index() constructor to allow creation of RangeIndexes() with

Index(20)
Index(2, 20)
Index(0, 20, 2)

in analogy to

range(20)
range(2, 20)
range(0, 20, 2)

@jreback
Copy link
Contributor

jreback commented Apr 21, 2015

almost all testing is now done in test_index.py
not precluding another file but most / all should be there (in a new section) inheriting from the base tester

@ARF1
Copy link
Author

ARF1 commented Apr 21, 2015

@jreback Thanks. I am currently rewriting the test suite of Int64Index for RangeIndex.

Clearly instantiation will be different. Which of the following cases should be possible?

1. RangeIndex(0.5, 10, 0.1)
2. RangeIndex()  --> empty, might be useful to allow cheap resizing by setting start/stop property of index
3. RangeIndex(1, 1)
4. RangeIndex(np.nan)
5. RangeIndex([1, 2, 3]) --> I.e. Check if list is monotonically increasing and represent compactly if possible, fall-back to `Int64Index` otherwise?
6. RangeIndex(1, dtype=np.int4) --> relevant for `index.values` behaviour. Necessary? Can we restrict to int64?
7. Index(1) --> RangeIndex(1) ?

Should RangeIndex(5).values return np.arange(5) or something memory saving?

@shoyer
Copy link
Member

shoyer commented Apr 21, 2015

I would not trouble yourself with allowing for non-integer or missing start, step or stop. Also, IMO arguments to RangeIndex should be parsed exactly like the built-in range (except for the optional name keyword argument).

# could *replace* itself on its parent. (tradeoff between instantiation time and
# ability to gc values when they aren't needed anymore)
# TODO: Block setting of start and end
def __new__(cls, left, right=None, step=1, name=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would follow the signature of the builtin range here:

class range(stop, *, name=None)
class range(start, stop, step=1, *, name=None)

You'll need to parse args and kwargs manually to make this work.

@shoyer
Copy link
Member

shoyer commented Apr 21, 2015

For implementation, I would strongly recommend keeping track of builtin range object (from Python 3 or a backport) under the hood. range already supports all the bounds checking and slicing logic.

@ARF1
Copy link
Author

ARF1 commented Apr 22, 2015

I find that I frequently have to check whether the supplied dtype parameter it equivalent to np.int64. Since dtype can be string, etc, my current solution is this:

try:
    got_int64_dtype = isinstance(np.dtype(dtype), np.int64)
except TypeError:
    got_int64_dtype = False

Is there a converter function already available for use in the idiom isinstance(some_conv_fn(dtype), np.int64)?

@shoyer Thanks for the helpful advice and references.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2015

this should use virtually the same __new__ as Int64Index. Use com.is_integer_dtype.

@ARF1
Copy link
Author

ARF1 commented Apr 22, 2015

Ok, I am slowly making progress: I finally figured out how to do an "analytic" intersection (inner join) between two general, dissimilar RangeIndex of the type range(start, stop, step). Of course this results simply in a new RangeIndex with a new set of parameters.

Now I am a bit unsure on how to proceed with the union (outer join) of two RangeIndex. This can in general not be represented as a single range(start, stop, step) but in the worst case requires two ranges with different parameters.

One could imagine a CompoundRangeIndex that keeps a list of multiple RangeIndex.

I was planning to avoid converting RangeIndex to Int64Index as much as possible but at a certain point operations on a numpy array will probably become faster than running through as list of different RangeIndex contained in a CompoundRangeIndex...

What does everybody think? Is it worth the effort? Or shall I just convert to a Int64Index when a union is required and be done with it?

@shoyer
Copy link
Member

shoyer commented Apr 22, 2015

I would just covert to Int64Index when necessary, possibly even for all union operations on RangeIndex objects (for the sake of consistency).

ARF added 5 commits April 23, 2015 20:14
RangeIndex(1, 10, 2) is a memory saving alternative to Index(np.arange(1, 10,2)).
This implementation is compatible with the current Index() api and is a drop-in
replacement for Int64Index(). It automatically converts to Int64Index() when
required by operations.
@ARF1
Copy link
Author

ARF1 commented Apr 24, 2015

I am closing this in favor of #9977 to deal with merging issues.

@ARF1 ARF1 closed this Apr 24, 2015
@ARF1 ARF1 deleted the range_index branch April 24, 2015 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants