-
Notifications
You must be signed in to change notification settings - Fork 21
Dataframe MVP #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
meta-question (which is probably appropriate for the Thursday call): why do this in a GitHub issue rather than the RFC document? IMO this document is a better home for it so that we can comment inline. |
I hope we can term it something else besides "DataFrame". To me dataframe implies pandas version of this concept and I prefer another term to help distinguish. For example, Matlab has Datasets and Tables, R has Tables and xtabs. Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage, RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc. I dont care much, as long as we dont overload the term DataFrame. |
The name "Data Frame" goes back at least to R, so it's not specific to
pandas (though the exact spelling `DataFrame` might be).
…On Mon, Jun 15, 2020 at 11:21 AM tdimitri ***@***.***> wrote:
I hope we can term it something else besides "DataFrame". To me dataframe
implies pandas version of this concept and I prefer another term to help
distinguish. For example, Matlab has Datasets and Tables, R has Tables and
xtabs.
Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage,
RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc.
I dont care much, as long as we dont overload the term DataFrame.
It will get confusing when pandas APIs for its DataFrames are different
than this group's recommended APIs for DataFrames.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIW6VKES367T7HIBVJLRWZC7TANCNFSM4N6C3XHQ>
.
|
Yeah I see R with data.frame vs data.table. For many python data analytics users, the word dataframe is intertwined with pandas dataframe. Feels like this is the "pandas" club, and I hope we can break out from its orbit and declare independence. Is this group's goal to
If no. 1, we should just state that is what this group is doing. At which point im disappointed and feeling hoodwinked. If no. 2, we should then indicate the most common tasks a user wants to perform - including two step tasks that can be consolidated into one. For instance, setindex followed by pivot can be made into one task and indicates some of the problems of following the pandas API model. One step to breaking free from pandas APIs is to change the name. Further, having the same name and methods, that do things differently is more confusing for future users. |
I don't care especially strongly about the name, but vaex, modin, dask,
cudf, mars, and pyspark all use DataFrame.
…On Mon, Jun 15, 2020 at 8:00 PM tdimitri ***@***.***> wrote:
Yeah I see R with data.frame vs data.table. For many python data analytics
users, the word dataframe is intertwined with pandas dataframe. Feels like
this is the "pandas" club, and I hope we can break out from its orbit and
declare independence. Is this group's goal to
1. Cleanup pandas APIs, consolidate them, and get rid of duplicate
methods
2. Come up with a better model for the common tasks at hand
If no. 1, we should just state that is what this group is doing. At which
point im disappointed and feeling hoodwinked.
If no. 2, we should then indicate the most common tasks a user wants to
perform - including two step tasks that can be consolidated into one. For
instance, setindex followed by pivot can be made into one task and
indicates some of the problems of following the pandas API model.
One step to breaking free from pandas APIs is to change the name. Further,
having the same name and methods, that do things differently is more
confusing for future users.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISBQQGEJ4NREMEDEVDRW272TANCNFSM4N6C3XHQ>
.
|
Yes the impression I get from most of those was a conscious effort to be similar enough to the pandas dataframe so that existing code could be ported over more easily, and common methods could be used in the same way. In riptide we called it a Dataset the name difference worked out well. Our examples use "ds" instead of "df". If we called it DataTable or DataGrid we could use "dt" or "dg". Then it would be easier to explain to users why we have different APIs with different kwargs. Is the goal to be ~80% like the pandas Dataframe API or to be a new class with similar but different methods? Different group members may have different opinions. Perhaps we can take a vote on this because I do care about the name. From my experience, I think calling it Dataframe is bound to lead to similar methods. Maybe that is what most people want, but it is not what i signed up for. |
I think it's best to just vote on it -- we can either do that at tomorrow's meeting, or we can decide to pick a working name for now then vote on a final name later on (when we're finalizing the spec).
Some additional data points:
Note: I don't have a preference towards any particular name. I do feel like the |
@datapythonista The naming dicussion has more-or-less taken over this issue, but I think it's important we address the other points you brought up as well. Maybe we just rename / simplify this issue to be about the naming only and copy the rest of your original post to another issue so we can discuss those points? |
My main idea here, more than making decisions on the specific points, was about the methodology to move forward. We've got several issues now, with very interesting discussions. But I felt that instead of ending up with even more open discussions, finding the points for a minimal API, and deciding on those, could help get started. Then we could work as follows:
I think this should help a bit keep focused. Being able to see the progress on what has already been agreed, and somehow limiting the number of open discussions at a time. But that's just a personal preference, to work in a more structured way. Surely other people will have ideas on how to work efficiently. We can discuss if this makes sense in the call tomorrow. Also, if we want to take a subset of pandas as a starting point or not, as it was asked here. And what are the points we want to start with, if people like this approach. |
Agreed, methodology for constructing the APIs will be the main topic for tomorrow's call. Starting with arrays, where we're a little further along with tooling, and obtaining data and making decisions is generally a bit easier. And then for dataframes we should consider what parts of that methodology are applicable, and how to deal with dataframe-specific pain points. |
When designing an API I often request good use case examples. I am not sure we have those yet (perhaps we do and I missed it). Different industries have different use cases, which can help us determine the APIs. For instance, if I want to select all rows with last name 'Dimitri' and first name 'Thomas' , what are the ways to do that? (that is just a simple one, i am hoping for more sophisticated ones). Therefore I think one step in the methodology for constructing the APIs involves good use cases from different industries. |
This has been superseded by #25, closing. |
We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.
This is a draft of the topics that we may want to discuss, and a possible order to discuss them:
df[col]
,df[col1, col2]
), and calling methods in 1 vs N columnsThe idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.
Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.
The simpler questions that need to be answered to define this MVP API are:
DataFrame
orDataframe
, to be consistent with Python class capitalizationdataframe
, using Python type capitalization (as inint
,bool
,datetime.datetime
...num_columns
,num_rows
)len
:len(df)
,len(df.columns)
shape
(it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)dtypes
property enough?)columns
,column_names
...The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:
Loading and exporting data
pandas.read_csv
...) and for loading data from memory (DataFrame.from_dict
)? Or a standard way for all loading/exporting is preferred?How to access and set columns in a dataframe
__getittem__
directly (df[col]
/df[col] = foo
)__getitem__
over a property (df.col[col]
/df.col[col] = foo
)df.get(col)
/df.set(col=foo)
)The text was updated successfully, but these errors were encountered: