Dataframe MVP #14

datapythonista · 2020-06-15T11:36:40Z

We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.

This is a draft of the topics that we may want to discuss, and a possible order to discuss them:

Dataframe class name Dataframe class name #17
Get number of rows and columns Get number of rows and columns #20
Get and set column names Get and set column names #21
Selecting/accessing columns (df[col], df[col1, col2]), and calling methods in 1 vs N columns
Filter data
Indexing, row labels Implicit alignment in operations #12
Sorting data
Missing data Missing Data #9
Constructor and loading/dumping data
Map operations (abs, isin, clip, str.lower,...)
Map operations with Python operators (+, *,...)
Reductions (sum, mean,...) Reductions #11
Aggregating data and window functions
Joining dataframes
Reshaping data (pivot, stack, get dummies...)
Setting data (mutability, adding new columns) Mutability #10
Displaying data, visualization, plotting API for viewing the frame #15
Time series operations
Sparse data

The idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.

Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.

>>> from whatever import dataframe

>>> data = {'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}

>>> df = dataframe.load(data, format='dict')
>>> df
a b c
-----
1 3 5
2 4 6

>>> len(df)
2
>>> len(df.columns)
3

>>> df.dtypes
[int, int, int]

>>> df.columns
['a', 'b', 'c']

>>> df.columns = 'x', 'y', 'z'
>>> df.columns
['x', 'y', 'z']

>>> df
x y z
-----
1 3 5
2 4 6

>>> df['q'] = [7, 8]
>>> df
x y z q
-------
1 3 5 7
2 4 6 8

>>> df['y']
y
-
3
4

>>> df['z', 'x']
z x
---
5 1
6 2

>>> df.dump(format='dict')
{'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'q': [7, 8]}

The simpler questions that need to be answered to define this MVP API are:

Name of the dataframe class. I can think of two main options (feel free to propose more):
- DataFrame or Dataframe, to be consistent with Python class capitalization
- dataframe, using Python type capitalization (as in int, bool, datetime.datetime...
How to obtain the size of the dataframe?
- Properties (num_columns, num_rows)
- Using Python len: len(df), len(df.columns)
- shape (it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)
How to obtain the dtypes (is a dtypes property enough?)
Setting and getting column names
- Is using a Python property enough?
- What should be the name? columns, column_names...

The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:

Loading and exporting data
- Should the dataframe class provide a constructor? If it does, should support different formats (like pandas)?
- Should we have different syntax (as in pandas) for loading data from disk (pandas.read_csv...) and for loading data from memory (DataFrame.from_dict)? Or a standard way for all loading/exporting is preferred?
How to access and set columns in a dataframe
- With __getittem__ directly (df[col] / df[col] = foo)
- With __getitem__ over a property (df.col[col] / df.col[col] = foo)
- With methods (df.get(col) / df.set(col=foo))
- Is more than one way needed/preferred?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-06-15T12:59:24Z

meta-question (which is probably appropriate for the Thursday call): why do this in a GitHub issue rather than the RFC document? IMO this document is a better home for it so that we can comment inline.

tdimitri · 2020-06-15T16:20:59Z

I hope we can term it something else besides "DataFrame". To me dataframe implies pandas version of this concept and I prefer another term to help distinguish. For example, Matlab has Datasets and Tables, R has Tables and xtabs.

Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage, RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc.

I dont care much, as long as we dont overload the term DataFrame.
It will get confusing when pandas APIs for its DataFrames are different than this group's recommended APIs for DataFrames.

TomAugspurger · 2020-06-15T18:21:16Z

The name "Data Frame" goes back at least to R, so it's not specific to pandas (though the exact spelling `DataFrame` might be).

…

On Mon, Jun 15, 2020 at 11:21 AM tdimitri ***@***.***> wrote: I hope we can term it something else besides "DataFrame". To me dataframe implies pandas version of this concept and I prefer another term to help distinguish. For example, Matlab has Datasets and Tables, R has Tables and xtabs. Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage, RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc. I dont care much, as long as we dont overload the term DataFrame. It will get confusing when pandas APIs for its DataFrames are different than this group's recommended APIs for DataFrames. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIW6VKES367T7HIBVJLRWZC7TANCNFSM4N6C3XHQ> .

tdimitri · 2020-06-16T01:00:13Z

Yeah I see R with data.frame vs data.table. For many python data analytics users, the word dataframe is intertwined with pandas dataframe. Feels like this is the "pandas" club, and I hope we can break out from its orbit and declare independence. Is this group's goal to

Cleanup pandas APIs, consolidate them, and get rid of duplicate methods
Come up with a better model for the common tasks at hand

If no. 1, we should just state that is what this group is doing. At which point im disappointed and feeling hoodwinked.

If no. 2, we should then indicate the most common tasks a user wants to perform - including two step tasks that can be consolidated into one. For instance, setindex followed by pivot can be made into one task and indicates some of the problems of following the pandas API model.

One step to breaking free from pandas APIs is to change the name. Further, having the same name and methods, that do things differently is more confusing for future users.

TomAugspurger · 2020-06-16T19:32:14Z

I don't care especially strongly about the name, but vaex, modin, dask, cudf, mars, and pyspark all use DataFrame.

…

On Mon, Jun 15, 2020 at 8:00 PM tdimitri ***@***.***> wrote: Yeah I see R with data.frame vs data.table. For many python data analytics users, the word dataframe is intertwined with pandas dataframe. Feels like this is the "pandas" club, and I hope we can break out from its orbit and declare independence. Is this group's goal to 1. Cleanup pandas APIs, consolidate them, and get rid of duplicate methods 2. Come up with a better model for the common tasks at hand If no. 1, we should just state that is what this group is doing. At which point im disappointed and feeling hoodwinked. If no. 2, we should then indicate the most common tasks a user wants to perform - including two step tasks that can be consolidated into one. For instance, setindex followed by pivot can be made into one task and indicates some of the problems of following the pandas API model. One step to breaking free from pandas APIs is to change the name. Further, having the same name and methods, that do things differently is more confusing for future users. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#14 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOISBQQGEJ4NREMEDEVDRW272TANCNFSM4N6C3XHQ> .

tdimitri · 2020-06-16T23:43:09Z

Yes the impression I get from most of those was a conscious effort to be similar enough to the pandas dataframe so that existing code could be ported over more easily, and common methods could be used in the same way.

In riptide we called it a Dataset the name difference worked out well. Our examples use "ds" instead of "df". If we called it DataTable or DataGrid we could use "dt" or "dg". Then it would be easier to explain to users why we have different APIs with different kwargs.

Is the goal to be ~80% like the pandas Dataframe API or to be a new class with similar but different methods? Different group members may have different opinions.

Perhaps we can take a vote on this because I do care about the name. From my experience, I think calling it Dataframe is bound to lead to similar methods. Maybe that is what most people want, but it is not what i signed up for.

jack-pappas · 2020-06-17T15:45:49Z

I think it's best to just vote on it -- we can either do that at tomorrow's meeting, or we can decide to pick a working name for now then vote on a final name later on (when we're finalizing the spec).

DataFrame is more recognizable to end users (due to it's widespread use in existing projects), but re-using that name could also lead to some confusion on their part if the specification we produce has non-trivial differences from those existing implementations. I assume that's why R's data.table project chose a different name -- it's API is still recognizable as a "DataFrame" in spirit but diverges enough from the built-in R data.frame that calling it the same thing would have been confusing to end users.

Some additional data points:

datatable calls it's data structure Frame.
pyarrow uses the name Table
Cassandra (the database) uses the term column family.

Note: I don't have a preference towards any particular name. I do feel like the Data prefix is somewhat implied though, so I'd lean towards a simpler name like Frame or Table.

jack-pappas · 2020-06-17T15:49:57Z

@datapythonista The naming dicussion has more-or-less taken over this issue, but I think it's important we address the other points you brought up as well. Maybe we just rename / simplify this issue to be about the naming only and copy the rest of your original post to another issue so we can discuss those points?

datapythonista · 2020-06-17T16:58:58Z

My main idea here, more than making decisions on the specific points, was about the methodology to move forward. We've got several issues now, with very interesting discussions. But I felt that instead of ending up with even more open discussions, finding the points for a minimal API, and deciding on those, could help get started.

Then we could work as follows:

We decide the initial points to discuss (e.g. class name, get/set columns...)
We open issues for those, and we try to reach decisions
Once there is agreement, we write the minimal API in the intended formats (RFC, not sure if we want a Python definition too, or something else)
Then, for the rest of issues, once there seems to be consensus in one of them, we open a PR to the RFC... with the outcome, to finalize the discussion.

I think this should help a bit keep focused. Being able to see the progress on what has already been agreed, and somehow limiting the number of open discussions at a time. But that's just a personal preference, to work in a more structured way. Surely other people will have ideas on how to work efficiently.

We can discuss if this makes sense in the call tomorrow. Also, if we want to take a subset of pandas as a starting point or not, as it was asked here. And what are the points we want to start with, if people like this approach.

rgommers · 2020-06-17T17:27:53Z

We can discuss if this makes sense in the call tomorrow.

Agreed, methodology for constructing the APIs will be the main topic for tomorrow's call. Starting with arrays, where we're a little further along with tooling, and obtaining data and making decisions is generally a bit easier. And then for dataframes we should consider what parts of that methodology are applicable, and how to deal with dataframe-specific pain points.

tdimitri · 2020-06-17T18:32:37Z

When designing an API I often request good use case examples. I am not sure we have those yet (perhaps we do and I missed it). Different industries have different use cases, which can help us determine the APIs. For instance, if I want to select all rows with last name 'Dimitri' and first name 'Thomas' , what are the ways to do that? (that is just a simple one, i am hoping for more sophisticated ones).

Therefore I think one step in the methodology for constructing the APIs involves good use cases from different industries.

datapythonista · 2020-08-17T23:53:47Z

This has been superseded by #25, closing.

datapythonista mentioned this issue Jun 23, 2020

Dataframe class name #17

Open

datapythonista mentioned this issue Aug 7, 2020

Dataframe interchange protocol #25

Closed

datapythonista closed this as completed Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe MVP #14

Dataframe MVP #14

datapythonista commented Jun 15, 2020 •

edited

Loading

TomAugspurger commented Jun 15, 2020

tdimitri commented Jun 15, 2020

TomAugspurger commented Jun 15, 2020 via email

tdimitri commented Jun 16, 2020

TomAugspurger commented Jun 16, 2020 via email

tdimitri commented Jun 16, 2020 •

edited

Loading

jack-pappas commented Jun 17, 2020

jack-pappas commented Jun 17, 2020

datapythonista commented Jun 17, 2020

rgommers commented Jun 17, 2020

tdimitri commented Jun 17, 2020

datapythonista commented Aug 17, 2020

Dataframe MVP #14

Dataframe MVP #14

Comments

datapythonista commented Jun 15, 2020 • edited Loading

TomAugspurger commented Jun 15, 2020

tdimitri commented Jun 15, 2020

TomAugspurger commented Jun 15, 2020 via email

tdimitri commented Jun 16, 2020

TomAugspurger commented Jun 16, 2020 via email

tdimitri commented Jun 16, 2020 • edited Loading

jack-pappas commented Jun 17, 2020

jack-pappas commented Jun 17, 2020

datapythonista commented Jun 17, 2020

rgommers commented Jun 17, 2020

tdimitri commented Jun 17, 2020

datapythonista commented Aug 17, 2020

datapythonista commented Jun 15, 2020 •

edited

Loading

tdimitri commented Jun 16, 2020 •

edited

Loading